Reader Comments

Post a new comment on this article

False and more false than ever

Posted by MKary on 08 Dec 2014 at 20:07 GMT

My two previous comments here consist of an extended discussion[1] and a summary[2] of the many faults of Olivier and Walter's reanalysis[3] of Walker's data.[4] Olivier and Walter's attempts to extricate themselves now spread across at least three posts in two journals. Here I largely consolidate my replies, leaving only a little more to one of the previous threads.[2]

Olivier's many parries are sometimes tangentially related, and too often irrelevant, misleading, or false. To further convince the reader, Olivier throws in a bunch of ad hominems.


1. Olivier states "Cohen gives recommendations of 0.2, 0.5 and 0.8 for small, medium and large effect sizes respectively. He considered any effect size less than small to be "trivial"." Elsewhere Olivier and Walter claim an effect size of d = 0.12 "is trivial by Cohen’s definition".[5]

In my previous comments[1, 2] I showed that Olivier and Walter have on numerous occasions backed up their false claims with false citations, i.e. to works that say no such thing, or even the contrary. And indeed both these citations of Cohen[6] by Olivier, alone or with Walter, are again false: Cohen did not define or specify any effect size to be trivial.[6] He proposed only that d = 0.2 was "small" but not trivial.[6] And what numerical example did Cohen give to first introduce his concept of d?[7] None other than d = 0.1, with nary a disparaging remark-- but with the corresponding sample sizes extensively tabulated.


2. In my previous comments[1, 2] I concluded that Olivier and Walter's reanalysis is constructed around the false claim that increasing the sample size increases the probability of Type I errors. Elsewhere Olivier and Walter have tried to extricate themselves from this embarrassing conclusion, arguing that (a) "increasing power when computing sample size leads to an increase in the probability of a type I error"; and (b) this, not the former, is the claim they base their reanalysis on.[5] To no avail: in fact this alternate claim is false as well, if it means anything at all.

"Increasing power when computing sample size" means sample size is a varying output (result), while power is a variable input. Since the probability of Type I errors is claimed to thereby increase, it too is a varying output. But then no result is forthcoming, because Olivier and Walter are positing but one equation[5] for the two items that are to be calculated (i.e., one equation in the two unknowns), sample size and Type I error probability. Thus their counter-assertion is also false: the circumstance they propose leaves the resulting Type I error probability undetermined, free to be chosen as seen fit. And indeed Olivier and Walter saw fit to choose exactly the same level for it as Walker: alpha = 0.05.[3, 4]

So in the end, which false assertion is Olivier and Walter's reanalysis[3] constructed around? The answer is to be found in the method they used to remedy the supposed problem of too much power.

If the problem really were too much power leading to a too-high Type I error level, then what Olivier and Walter would have done is simply to reduce the power by reducing alpha from its conventional level of 0.05-- the one used by Walker-- to some more stringent one. They did not: instead, they reduced the power by decreasing the sample size-- and left the pre-specified Type I error level just as it was.

Thus as I already said in my earlier reply: "of course, for a fixed sample size, power can be increased by using a less stringent criterion for significance. But that is not what Walker did, it is not what Olivier and Walter are objecting to, and the remedy they propose is not to use a more stringent criterion for significance, but to reduce the sample size post hoc via their resampling scheme."[1]

In short: Olivier and Walter's reanalysis is not merely based on one or the other of the two false claims, but both of them.


3. Olivier correctly states that the p-value is not alpha (the level for p to be described as statistically significant or not)-- the latter being a term I did not use. He also states that alpha is the Type I error probability and "alpha must be chosen before any analysis is performed and usually fixed at 5%. This is a fundamental concept taught in any introductory statistics course."

What Olivier does not mention is that at a more advanced level-- or in introductory courses taught at more advanced institutions-- one understands that things are not so simple. First, alpha is a decision criterion: that (if respected) its value should equal the Type I error probability is a theorem, not a definition (there can be no Type I errors without the corresponding decisions). And as a theorem, it holds true only under certain circumstances. Thus the hypothesis here that "there is no true difference in the population",[3] i.e. with and without the treatment, is not just a null hypothesis but also a nil hypothesis (Cohen 1994) concerning the real world.[8] And as Cohen, and others including both Tukey and Schmidt have explained, for this class the probability of a Type I error is zero, regardless of one's choice of alpha. Thus for this class, the concept of alpha is at best strained. But in all cases evidentiary p-values, being not decision criteria but results given a hypothetical, always makes sense, and as hypotheticals in themselves, they do always specify either real or merely hypothetical or nominal Type I error probabilities: the p-value is always "the probability that the null hypothesis would be falsely rejected when true, were the data just decisive against the hypothesis." (Cox 1982)[9] And as I stated previously,[1] the largest p-value reported by Walker,[4] related to helmet wearing, and described by him as significant (i.e., as just or more decisive against the null hypothesis), was p = 0.007-- much smaller than any usual decision criterion.

Thus if one wants to talk in terms of the relation between Type I error rates and alpha, the problem Olivier purports to be concerned with can also be posed this way: did Walker specify a decision criterion (alpha level) that would give a problematic hypothetical or nominal Type I error probability if followed? And more importantly in the present context: did Olivier and Walter even try to do anything to mitigate this supposed problem?

The answer to both questions is that Walker specified a thoroughly conventional alpha = 0.05-- and that Olivier and Walter did exactly the same. In other words, although Olivier and Walter claim to have done their reanalysis to address an overly high Type I error probability, in fact even in their own terms, from the get-go they did not do anything to even try to mitigate this supposed problem.

To close this section on a related note: Olivier does correctly state that hypothesis testing has been compared to trial in a court of law. But he then tells us this means one should favour avoiding Type I errors over Type II. He does this to help his strange argument that Walker's study is dangerously (as opposed to uneconomically) overpowered. Contrast Olivier's position with that of the inventors of the concepts of Type I and Type II errors, Neyman and Pearson[10]:

"These two sources of error [Type I and Type II] can rarely be eliminated completely; in some cases it will be more important to avoid the first, in others the second. We are reminded of the old problem considered by Laplace of the number of votes in a court of judges that should be needed to convict a prisoner. Is it more serious to convict an innocent man or to acquit a guilty? That will depend upon the consequences of the error; is the punishment death or fine; what is the danger to the community of released criminals; what are the current ethical views on punishment? From the point of view of mathematical theory all that we can do is show how the risk of the errors may be controlled and minimised. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator."


4. Olivier writes:
"You state 'In fact, large sample sizes never contribute to spurious results.' Here are a few references that state otherwise.[11, 12] A simple example is to consider the t-test. [This example shows that] any observed difference can be statistically significant at any alpha level. It is therefore important to compute sample size with this in mind, not overly small and not overly large."

Suppose the nonsense expressed in that last sentence were true: that a sample size could be so large as to give problematic results. How large would that be? If there were such a size, then the largest conceivable sample must be at least that. The largest conceivable sample is simply the entire population. But when we know the entire population, there is no statistical inference involved, therefore no statistical errors of any kind are possible, and so no results are problematic and all are true. Thus indeed Olivier's claim is nonsense.

Olivier could have avoided making such an embarrassing claim if he had just actually read any of the three references he has cited to support it (Olivier has now quietly left out the one he and Walter[3] formerly used, since I showed[1] it said no such thing). In fact one of them[12] explains the falsity of Olivier's claim so well, it is worth quoting in detail. Since it is from the "Statistics Help" section of the Minitab blog, it consists of a question from a perplexed blog follower, and an answer from the blog author.[12] The question from the perplexed blog follower mirrors identically Olivier's claim above, which I repeat for the convenience of the reader, to see them immediately adjacent:

[Olivier:]
"A simple example is to consider the t-test. [This examples shows that] any observed difference can be statistically significant at any alpha level. It is therefore important to compute sample size with this in mind, not overly small and not overly large."

[Perplexed blog follower in the citation Olivier gave[12] to support his view:]

"I am really stuck how to choose a sample size. Large sample size seems to reject the null hypothesis whereas a small sample size accepts the null. Is there a method to choose correct sample size?"

[Answer from the author of and in the citation Olivier gave[12] to support his view:]

"I'm not sure what you mean by "choosing the correct sample size." For a given amount of variation in the data, a larger sample will give the test more power to detect a significant difference in means, if one exists. Remember, what you're really asking or finding out when you perform a t-test on the difference in means is: "Do I have ENOUGH EVIDENCE to conclude that the difference between these two means is statistically significant?". So all things being equal, a larger sample will be more likely to provide that evidence than a smaller sample. The difference between the means should be roughly the same with a larger or smaller sample (ASSUMING that your sample is representative of the population in either case). So really, the t- test is just telling you whether you have enough evidence to make an inference about the entire population. A larger sample will be more likely to allow you to do that. It's not a bad thing (to reject the null with a large sample). You just need to go one step further and evaluate whether the statistically significant difference itself has any practical ramifications--and that doesn't depend on your sample size (or your p- value result)."

Or as I said previously-- and unlike Olivier, to quote so as to give the complete thought: "In fact, large sample sizes never contribute to spurious results. They can only confuse those who do not understand the difference between statistical and other types of significance. Olivier and Walter have confused the finding of a difference that putatively does not make a difference, with a Type I error."[1]


5. The reader familiar with Olivier's style of argumentation may be struck by the fact that in the previous item, Olivier cited a non-peer reviewed source (in fact a blog entry) to (falsely) back up a (false) claim that is central to the rationale of his reanalysis.[3]

This is striking because in concert with various colleagues, Olivier readily attacks anyone for, in disagreeing with them, citing anything not peer reviewed-- even when done in a non-peer reviewed forum, such as a comments section. The reader need only look to this very forum for examples.

Yet the previous item is not the only occasion where Olivier cites non-peer reviewed sources to justify crucial principles of his reanalysis.[1] Nor is it the only occasion where Olivier cites sources, peer-reviewed or not, that do not make the claims he attributes to them, or even make a contrary claim (see e.g. item 1 here, and my two previous response threads[1, 2]). Thus in a previous response[1] I noted that Olivier and Walter cite non-peer reviewed advocacy documents to make a safety claim fundamental to their analysis, and that some of these sources actually contradict that claim (that there is a dichotomy between safe and unsafe passing at 1 m). I also noted the manifest hypocrisy of such manoeuvring on their part.

Thus to dispense with another of Olivier's attempts at rebuttal:

Olivier states "you have cited your response to justify the claim 'a recent and elaborate statistical reanalysis is constructed around the false claim that increasing the sample size increases the risk of type I errors'. I find this odd considering one of your criticisms is we cite non-peer reviewed work." Elsewhere Olivier and Walter resort to the same objection, although without mentioning their own fault.[5]

I cited my response to elaborate, not justify, my claim. When an author cites an external document by the self-same author to do that, it functions as a footnote or appendix to the citing article, the choice between the three formats being a matter of convenience.

By contrast Olivier and Walter cite multiple non-peer reviewed political advocacy documents by authors other than themselves, not to elaborate an argument they make, rather to use someone else's political compromises as a truth about road safety, and so to justify the method of reanalysis they employ.


6. Olivier states "you assert our study “confirms that Walker did his own analysis
correctly”. This is not true. In our paper, we argue that piece-meal analysis, as Walker did, does not adequately address..."

I said, "his own analysis"- not someone else's. Indeed Olivier and Walter explicitly state in their article that they did reproduce and confirm Walker's results, and they did not raise any issue of miscalculation.

Now Olivier complains he could not reproduce Walker's power calculation: with various other settings, for a power of 98%, he gets n = 2251, not the n = 2259 given by Walker. He also complains that he does get 98% power if he works backwards from n = 2259, and thereby insinuates some sort of misconduct on Walker's part, because Walker stated his power analysis was done a priori, not post hoc. In his response to my other previous comment[2] Olivier goes beyond insinuation, stating "He [Walker] may have indicated this was done a priori, but I do suspect he computed power post hoc after collecting a sizeable sample".

As usual, what Olivier does not tell us is more important than what he does. What Olivier does not tell us is that... one also gets 98% power working backwards for any n from 2159 to 2261 inclusive. So what is it then: what is so special about n = 2259 that Walker supposedly chose it, surreptitiously, over all 102 others? This especially since the sample size actually used by Walker was not any of these, such that the magic number 2259 plays no role of any kind in Walker's analysis.

As to why Walker used n = 2355 in practice: welcome to the real world of experimental science. Once n = 2251 or 2259 was reached, should Walker have just shut down everything and hitchhiked home? Moreover various data, which include some from video analysis, should be expected to be sometimes unusable, and so the desirability of some cushion.

In short, Olivier is grasping at straws. There are numerous legitimate possible explanations for the difference, including the numerous fixes to G*Power over the intervening years or a simple typo, there are no consequences to the difference, and there are no legitimate grounds for Olivier to cast aspersions.


7. Olivier and Walter state: "[Walker's] is an overpowered study far above the usual convention of 80% power being adequate to detect an effect of any size". Here Olivier further states "I have never come across a study that computed sample size powered at 98%."

Olivier expects us to take these statements as an indictment of Walker's study, rather than of his own inexperience or worse:

(i) The default setting in G*Power is for a power of 95%.

(ii) An internet search for "98% power to detect" easily turns up e.g. these two studies:

Mullins N et al. Investigation of blood mRNA biomarkers for suicidality in an independent sample. Translational Psychiatry (2014) 4, e474; doi:10.1038/tp.2014.112
http://www.nature.com/tp/...

Arnold ME et al. Sensitivity of environmental sampling methods for detecting Salmonella Enteritidis in commercial laying flocks relative to the within-flock prevalence. Epidemiology and Infection 2010;138(3):330-9. doi: 10.1017/S0950268809990598.
http://www.ncbi.nlm.nih.g...

(iii) Olivier relies frequently (albeit falsely: see item 1) on Cohen's work on sample size and power. He should know then that Cohen extensively tabulated sample size requirements, not for 98% but for 99% power, with nary a disparaging remark.[7]

(iv) The renowned "Statistics Guide for Research Applicants",[13] authored by a number of renowned medical statisticians, says this:

"Power is typically set at 80%, 90% or 95%. Power should not be less than 80%. If it is very important that the study does not miss a real effect, then a power of 90% or more should be applied."

Contrary to Olivier, these authors wisely give no warning whatsoever of a resulting increased risk of Type I errors, or of spurious results of any kind.


8. Olivier states: "Walker’s most recent paper on this topic supports the notion the statistically significant helmet wearing effect was spurious".

As usual, what Olivier does not tell us is more important than what he does. What he does not tell us is that this experiment could neither confirm nor refute the helmet wearing effect, because there were no two groups that differed by only helmet wearing in their experimental treatment.[14]


9. Olivier states: "increasing sample size cannot “correct” for bias. Collecting more and more biased data does not result in a representative sample. I’m confused by your quotes “good quality prior studies”. We never stated that anywhere and I find it troubling that you attribute that comment to us."

I am "confused" by Olivier's quotes " “correct” for bias ". I never stated that anywhere and I find it troubling that Olivier attributes that comment to me.

I am also "confused" by Olivier's claim that I attributed the phrase "good quality prior studies" to him. On the contrary I explicitly attributed it to Ewald (2006),[15] right in the preceding sentence. The next sentence, where Olivier and Walter's names occur in conjunction with that phrase, describes how their citation of Ewald to justify their claim is inappropriate. I find it troubling that Olivier attributes such an attribution to me.


10. Olivier: "You state “there is unequivocal experimental demonstration of risk compensation when wearing safety equipment, including bicycle helmets” followed by three citations. The first two don’t seem to be relevant here as the participants were not cycling."

This is a strange complaint, considering that Olivier and Walter diminish risk compensation theory itself, not just in its application to bicycling.


11. Olivier: "Risk compensation is directional – it’s about behaviour change when putting safety equipment on and not taking it off."

Besides being strange, this claim is completely at odds with both the spirit and substance of either standard risk compensation or risk homeostasis theory. Either Olivier gets it from his imagination, or from reading only secondary sources that give only the part of the idea that is relevant to their context, this being almost always the introduction of some putative safety measure. Better to get it straight from the mouths of some of its horses, or even from a famous review article:

"Risk compensation – the proposition that a person’s perception of risk influences their risk-taking behaviour". (Adams 2013)[16]

"[Risk homeostasis] may be compared to a thermostat. The thermostat determines the action of the heating system and thus the temperature in the house, while the temperature in the house [level of risk] controls the actions of the thermostat. Fluctuations may occur, but the time averaged temperature remains the same, unless the desired (target) temperature [target level of risk] is set at a different level." (Wilde 2002)[17]

"We all change our behavior in response to some changes in perceived injury risk. Most obviously, we may take additional precautions if we believe our risk has increased. When roads and sidewalks are icy, we may walk more carefully for fear of falling and we may drive more slowly to be sure that we can stop safely. But it is not at all obvious that we change our behavior in response to every increase or decrease in risk. The heart of the risk compensation debate lies in determining which risk changes will produce compensating behavioral change." (Hedlund 2000)[18]


12. Olivier: "In 2001, the Thompsons and Rivara called for a systematic review of the
evidence around risk compensation. Thirteen years on, no such review
exists. Why?"

Instead of asking rhetorical questions, why doesn't Olivier simply read the responses by Adams and Hillman that went along with that commentary by the Thompsons and Rivara, where his question was already twice given an answer 13 years before he asked it?

http://injuryprevention.b... [19]

As these authors explain, in fact Olivier's question was also answered a third time, 14 years before he asked it, in and by Hedlund's (non-systematic) review.[18]

To put it more simply: calling for a systematic review of the evidence around risk compensation is rather like calling for a systematic review of the evidence around Mendelian genetics, to take an arbitrary example. A century and a half after its introduction, no such review exists. Why? How can we dare nonchalantly use the theory without one?


13. Olivier: "The differences in passing distance are quite similar whether wearing a
helmet or not until after 2m."

Elsewhere Olivier states: "The real differences in passing distance when wearing or not wearing a helmet occur for distances greater than 1.5m."[20]


14. Olivier: "Can you provide evidence, not hypotheticals, that a difference in 7.2cm when the vehicle is already beyond 2m is a problem?"

Suppose the effect size were one foot. That ought to trigger a red alert, even if it were agreed that we wouldn't have evidence that the finding applied to other societies, built environments, driving or riding cultures, or even later years given the time trends typical of road safety. But would there be "evidence, not hypotheticals," that a difference of one foot is a problem anyway? When the vehicle is already beyond 2 m? Or for any other passing event analyzed by Walker?

Exactly zero of the passing events analyzed by Walker were at a distance of less than 0.3048 m (= 1 ft), and moreover the minimum of these was at 0.394 m. Consequently the safety outcome would have been exactly the same as what was actually the case: two collisions while already wearing a helmet-- leaving the safety implication purely "hypothetical".

Olivier seems to not realize that statistical probability calculations and effect estimates, including all of those in his article, are all hypotheticals: they are all based on unprovable hypotheses such as normality of the underlying distributions, independence, nullity, or any of many more. The hypotheses used in my examples are the laws of motion, the dimensions of bicycle and body parts, and circumstances that are known to occur. Thus:

Seven millimetres (an effect size for certain circumstances calculated by Olivier elsewhere) is approximately one-third the diameter of handlebar tubing. Seven centimetres is a vital fraction of the diameter of a human limb, skull, or torso. If after an initial set-up of whatever passing distance, an unexpected excursion of driver or rider (e.g. swerving to avoid a suddenly opened door) changes the gap to 0, and helmet wearing to one either of those distances closer, the contact goes from none to brushing to one with sufficient mechanical purchase to be disastrous.

In other words, finding out what effect size is practically significant or not is the business of the scientist, engineer or clinician, not the statistician.


15. Olivier: "You indicate we do not understand the difference between “statistical
and other types of significance”."

Olivier and Walter have offered nothing and have nothing to offer for the real problem: is there practical significance to the effect size that was found? While phrasing their article such that the reader will think this is the question they address, in fact their entire analysis concerns only statistical significance, which once reduced by their methods they present as if it were practical significance.

Thus by now most readers will find that my claim, that they do not understand the difference between statistical and practical significance, is far more favourable to them than its negation.


References

1 . Kary M. Fundamental misconceptions of safety and of statistics. http://www.plosone.org/an...
2 . Kary M. A selection of corrections. http://www.plosone.org/an...
3 . Olivier J, Walter SR (2013) Bicycle helmet wearing is not associated with close motor vehicle passing: a re-analysis of Walker, 2007. PLoS ONE 8(9): e75424. doi:10.1371/journal.pone.0075424
4 . Walker I (2007). Drivers overtaking bicyclists: Objective data on the effects of riding position, helmet use, vehicle type and apparent gender. Accident Analysis and Prevention 39:417–425. doi:10.1016/j.aap.2006.08.010
5 . Olivier J, Walter SR. Too much statistical power can lead to false conclusions: a response to ‘Unsuitability of the epidemiological approach to bicycle transportation injuries and traffic engineering problems’ by Kary. Inj Prev Published Online First: 30 October 2014 doi:10.1136/injuryprev-2014-041452
6 . Cohen J. A power primer. Psychological Bulletin 1992;112:155–159.
7 . Cohen J. Statistical Power Analysis for the Behavioral Sciences. Academic Press, Orlando, Florida, 1977 (Revised Edn.)
8 . Cohen J. The earth is round (p < .05). American Psychologist 1994;49:997-1003.
9 . Cox DR (1982). Statistical significance tests. Br J Clin Pharmac 14:325-331.
10 . Neyman J, Pearson ES (1933). On the problem of the most efficient tests of statistical hypotheses. Phil Trans Roy Soc Lon Ser A 231:289-337.
11 . Helberg C (1996). Pitfalls of data analysis. Practical Assessment, Research & Evaluation 5(5). http://PAREonline.net/get...
12 . Runckel P. Large samples: too much of a good thing? The Minitab Blog, June 4 2012. http://blog.minitab.com/b...
13 . Bland JM et al. Statistics Guide for Research Grant Applicants. http://www-users.york.ac....
14 . Walker I, Garrard I, Jowitt F. The influence of a bicycle commuter’s appearance on drivers’ overtaking proximities: An on-road test of bicyclist stereotypes, high-visibility clothing and safety aids in the United Kingdom. Accident Analysis and Prevention (2013), http://dx.doi.org/10.1016...
15 . Ewald B. Post hoc choice of cut points introduced bias to diagnostic research. J Clin Epid 2006;59:798-801. doi: 10.1016/j.jclinepi.2005.11.025
16 . Adams J. Pater knows best.
http://www.john-adams.co....

17 . Wilde GJS. Does risk homoeostasis theory have implications for road safety. BMJ 2002;324:1149–52.
18 . Hedlund J. Risky business: safety regulations, risk compensation, and individual behavior. Inj Prev 2000;6:82-89. doi:10.1136/ip.6.2.82
19 . Adams J, Hillman M. The risk compensation theory and bicycle helmets; and Response. http://injuryprevention.b...
20 . Pless IB. Another antihelmet legislation argument bites the dust. Inj Prev 2013;19:440–441. doi:10.1136/injuryprev-2013-041049

No competing interests declared.

RE: False and more false than ever

jakeolivier replied to MKary on 09 Dec 2014 at 06:36 GMT

M Kary,

During my time researching cycling safety, and more specifically bicycle helmets, I have come across many sacred cows of the anti-helmet movement. Walker’s overtaking study is one of them and apparently one close to your heart. I don’t understand the motivation to viciously defend that position in light of contradicting evidence and I suppose I never will.

As I stated in a previous response, much of what you’ve written is an attack. For example, you state “Olivier and Walter have tried to extricate themselves from this embarrassing conclusion”. I’m in no way embarrassed by our research.

You also like using provocative terms and phrases like “misconceptions”, “corrections”, “more false than ever”, “embarrassing”, and “nonsense”. None of these statements are backed by any evidence. In the end, what you have written is inappropriate for this or any other journal.

In addition to our letter you mention, I’ve posted a longer response to your Injury Prevention commentary on my blog (which you haven’t cited).

http://injurystats.wordpr...
http://injuryprevention.b...

Below are responses to your numbered comments. As with your previous comments, I will demonstrate they are without merit.

1. Here’s what Cohen says about small, medium and large effect sizes in his 1992 Psychological Bulletin paper.

“To convey the meaning of any given ES index, it is necessary to have some idea of its scale. To this end, I have proposed as conventions or operational definitions small, medium, and large values for each that are at least approximately consistent across the different ES indexes. My intent was that medium ES represent an effect likely to be visible to the naked eye of a careful observer. (It has since been noted in effect size surveys that it approximates the average size of observed effects in various fields.) I set small ES to be noticeably smaller than medium but not so small as to be trivial, and I set large ES to be the same distance above medium as small was below it. Although the definitions were made subjectively, with some early minor adjustments, these conventions have been fixed since the 1977 edition of SPABS and have come into general use. Table I contains these values for the tests considered here.”

Why exactly can the reader not interpret an effect size smaller than d=0.2 as trivial? Cohen clearly indicates a small effect size was chosen large enough to not be trivial. Is there some other effect size category between small and trivial that I and others in the research community are unaware?

Since you don’t seem to agree with Cohen, how about Ian Walker himself? Below is a non-peer-reviewed blog entry by Walker, so I’m assuming you’ll criticise me for the citation.

http://staff.bath.ac.uk/p...

“However, there are problems with this process. As we have discussed, there is the problem that we spend all our time worrying about the completely arbitrary .05 alpha value, such that p=.04999 is a publishable finding but p=.05001 is not. But there is also another problem: even the most trivial effect (a tiny difference between two groups' means, or a miniscule correlation) will become statistically significant if you test enough people. If a small difference between two groups' means is not significant when I test 100 people, should I suddenly get excited about exactly the same difference if, after testing 1000 people, I find it is now significant? The answer is probably no -- if it was a trivial effect with 100 people it's still trivial with 1000: we don't really care if something makes just a 1% difference to performance, even if it is statistically significant. So what is needed is not just a system of null hypothesis testing but also a system for telling us precisely how large the effects we see in our data really are. This is where effect-size measures come in.”

Later on, Walker does indicate values of Cohen’s d less than 0.2 are trivial.

“Cohen suggested that d=0.2 be considered a 'small' effect size, 0.5 represents a 'medium' effect size and 0.8 a 'large' effect size. This means that if two groups' means don't differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant.”

Need I remind the helmet effect using Walker’s results is d=0.12 and in the trivial range (as defined by both Cohen and Walker).

2. As we note in our Injury Prevention letter, sample size is a function of effect size, type I error rate and power. The equation can also be re-arranged so that the type I error rate is a function of sample size, effect size and power. If sample size and effect size are held fixed, the type I error rate increases as power increases.

Isn’t your background in mathematics? I’m not sure why this needs further explanation.

3. As stated previously, the p-value is a function of sample size. So, any observed difference can be statistically significant given a large enough sample size. The passing distance could be a micron and a large enough sample size would result in a p-value less than a pre-chosen level for alpha. Would you then argue a micron is an important difference here?

4. Your entire spiel is dependent on the observed effect size being important. We’ve demonstrated here and elsewhere the observed effect sizes for helmet wearing using Walker’s data are trivial, at least from a safety perspective. We’ve also demonstrated many times now that any effect size, whether important or trivial, can be statistically significant given a large enough sample size.

I take issue with you calling my comment “nonsense”. A lot of effort is put into determining a sample size, in clinical trials and elsewhere, that is not overly small or overly large. You seem to justify your position that more observations implies we know more about the population. If so, then Walker’s data is pretty close to the population of Ian Walker cycling to and from work (Salisbury and Bristol, UK) over a two month period in 2006 and not the cycling population overall.

5. This point is just an ad hominem attack. You have stated “peer review as such is no great badge of honour” and your contributions to our knowledge of cycling safety are almost exclusively non-peer reviewed comments on the web. Why is that?

Self-publishing is no measure of rigour and that is why I’ve called you out. It is true we cite several non-peer reviewed documents in our paper. It includes a news article citing Ian Walker’s paper, cycling advocates calling for minimum passing distances, Ian Walker’s website where I downloaded the data and population density information for Bristol and Salisbury. Exactly which of those are inappropriate for this study? Note I used these websites for the information contained in them and not to highlight someone’s opinion.

6. Can you reproduce Walker’s sample size calculations and results? I sure couldn’t. Besides the sample size issue, I also could not identify the 35 observations removed “after being identified as extreme outliers in SPSS exploration.” You’re very happy to nit-pick our study (and Canadian studies in your Injury Prevention letter); however, you seem content on giving Walker a pass on his paper and analysis. Why?

7. It is true that G*Power uses 95% as the default and I always change it to something else. The default settings across statistical software are by no means standard. I published a paper earlier this year discussing some of those issues.

http://www.sciencedirect....

I searched PubMed using the key word “98% power” and repeated it for 80%, 85%, 90% and 95% power. Here are the numbers of documents identified.

Power Articles
80% 863
85% 54
90% 362
95% 52
98% 5

The use of 98% power seems to be quite rare in published research. Note one of the five articles was not in reference to 98% power from a statistical perspective, so there no more than four articles supportive of your hypothesis.

8. You claim I have little to say about Walker’s more recent study. Here’s the actual quote

“Walker’s most recent paper on this topic supports the notion the statistically significant helmet wearing effect was spurious. Cyclists wore seven different types of outfits with six wearing helmets and one not. The “casual” cyclist had a mean overtaking distance of 117.61 while the range of the means for all types was 114.05 to 122.12. The best result was to dress like a police officer (with a helmet on).”

The average passing distance for the cyclist type without a helmet was near dead center the averages of the other types. How exactly is that not clear to you?

9. In a previous comment you seem to argue larger sample sizes can correct for bias. Is that not true?

10. I included the other citations because risk compensation has been argued for many situations and not just cycling. Is that not true?

11. The risk compensation hypothesis is clearly directional as it relates to argument against helmets. The argument is about whether to promote or legislation helmet usage (or get unhelmeted cyclists to put helmets on). I don’t know of any academic arguing to take helmets off.

12. I read the entire exchange between the Thompsons/Rivara and Adams/Hillman, and the latter never actually answer the question. In their last response they state

“We wish them luck in their systematic review of all the tens of thousands of articles that have a bearing on risk compensation.”

For a recent paper, I tried to make sense of their arguments and found no supportive evidence. There appears to be little to no research on this topic and support for the risk compensation hypothesis appears to be based on belief and not evidence.

“Thompson, Thompson and Rivara (2001) have called for a systematic review of the evidence surrounding bicycle helmets and risk compensation. In their view, the “empirical evidence to support the risk compensation theory is limited if not absent.” In a response, Adams and Hillman (2001) argue such a review would be difficult due to the “tens of thousands of articles that have a bearing on risk compensation”. A search using the phrase “risk compensation” turned up 147 articles on Medline, 322 articles on Scopus and 343 articles on Web of Science (14 August 2014). The number of articles reduced dramatically when the phrase “bicycle helmet” was added to the search – 1 for Medline, 9 for Scopus and 6 for Web of Science. Note that 4 of the 9 Scopus articles are opinion pieces co-authored by Adams or Hillman.”

http://acrs.org.au/journa...

Your analogy of comparing Mendelian genetics with risk compensation is without any support. My point, as I believe the Thompsons/Rivara, is that if risk compensation is so pervasive, why is there such a lack of evidence?

13. I don’t see the problem here. They were from different analyses – one comparing passing distances as a continuous variable at various intervals and the other logistic regression using various cut-points. There is no guarantee the results will be identical from two statistical approaches.

14. Your argument here has no basis. The differences between passing distances don’t get above 1.7cm until after passing distance is over 2m in terms of helmets having a negative effect. However, passing distances are favorable for wearing helmets between 0 and 0.75m as Walker got an extra 5.2cm of passing distance when wearing a helmet. I haven’t highlighted this before because I believe the results from Walker’s data indicate a null effect of helmet wearing (either positively or negatively). However, I find it curious you haven’t noticed it before considering you think helmets play a role here.

Why must you disparage statisticians? Statisticians clearly have a role in sample size calculations and, in turn, determining effects of practical importance. I know of very few researchers who have a clear understanding of effect sizes in their field and determining an effect size a priori usually happens through a discussion with a statistician. What is your track record? You don’t seem to be aware of such things yet you are quick to make judgments about what happens in a research environment.

15. We have offered many reasons why we believe the results using Walker’s data are not supportive of a helmet effect on overtaking distance. You have just chosen not believe them.


What I wonder about you, and others I've come across, is what would your response be if Walker had observed greater passing distances when wearing a helmet. That’s not very far-fetched considering the small average distance in passing distances and large amount of variability (see Fig 1). My guess is you would have attacked Walker back in 2007 just like you’re doing now to me. In that hypothetical situation, I would have been just as critical of his analysis and conclusions as I am now. The only agenda I’m trying to push here is greater scientific understanding of cycling safety. I can’t say the same about you.

No competing interests declared.

Helmet effect size and direction in cases less than 0.75 metres

MKary replied to jakeolivier on 10 Dec 2014 at 17:15 GMT

With the exception of the subject of this post, I am satisfied that the opposing views have by now been expressed in a way that illuminates their respective worths. Indeed several of them have been argued two or more times, the only additional information revealed being something to do with temperament.

I see only one "new" item in Olivier's latest response:

Olivier: "passing distances are favorable for wearing helmets between 0 and 0.75m as Walker got an extra 5.2cm of passing distance when wearing a helmet. I haven’t highlighted this before because I believe the results from Walker’s data indicate a null effect of helmet wearing (either positively or negatively). However, I find it curious you haven’t noticed it before..."

Indeed I did notice this before, not the least when Olivier used it previously against another poster in this forum. I noticed it particularly because as usual, what Olivier doesn't tell us is more important than what he does.

Notice that, rather than consider the appropriate, obvious, and simpler condition "less than 0.75 m", Olivier carefully chooses the more complicated-- and inappropriate for the circumstance-- "between 0 and 0.75m". This is because Olivier is excluding the helmeted cases where Walker was actually struck.

Olivier is excluding them now despite the fact that in the preceding exchange, after I remarked in passing that they should be included in the calculations, Olivier had no problem doing so and recalculating on that basis-- there, he could do so without making things look worse for helmets.

But if these data are included in the circumstance Olivier now brings up-- as they should be-- then the picture Olivier has presented changes: now the effect size is 9.4 cm worse for the helmeted condition.

(This estimate comes from coding their distances as zeros, to be conservative in favour of the helmeted condition. In fact the true distances were slightly negative, and if they were known precisely would make the result still worse for helmets.)

This is the largest effect size of the entire study, and, as Olivier sees it, it is in the most crucial circumstance: 0.75 m or less being already a "risky scenario", as he has described it elsewhere.

But then, Olivier hadn't wanted to highlight this anyway. Perhaps next he will tell us why the result isn't important after all.

To respond in advance in that case: so why did Olivier bring it up in the first place?

No competing interests declared.

RE: Helmet effect size and direction in cases less than 0.75 metres

jakeolivier replied to MKary on 22 Jan 2015 at 05:31 GMT

M Kary

What exactly do you mean by temperament? You have used many provocative terms to insinuate my intentions are not sincere and your comments are very antagonistic. Is it really that difficult to stick to the material?

There was plenty new in my previous response. They all lead to the conclusion the effect of helmet wearing from Walker's data is a non-issue.

Below is my previous response regarding the two times Walker was hit by a vehicle – once by a bus and once by a heavy goods vehicle, and both while wearing a helmet. That is the only information I know about those two incidences.

“You state that, if included, the “various calculations are correspondingly affected.” This information can be added to the data set and reanalysed. Note the only information we have here is vehicle type, passing distance (set to 0) and helmet use (set to 1), so we cannot construct a similar model to what was published. With passing distance as the dependent variable, the estimated helmet effects are -8.2cm and -8.5 with and without those two observations. When passing distance is categorised by the one metre rule, the odds ratios are 1.24 and 1.30 with and without the additional observations. In each case, the estimated effect moved closer to a null effect. This may seem counterintuitive, but it’s because vehicle type is confounded with helmet wearing for those two observations and vehicle type is a better predictor of overtaking distance.”

There is no way to unconfound the effects of vehicle size and helmet wearing. I contend vehicle size is far more important here because the effect sizes are much larger from multivariable analyses (0.089 vs -0.058, linear regression; 0.58 vs 1.13, logistic regression) found in Tables 3 and 4 of our paper. And, if you want to establish helmet wearing is important here, you would need to first remove the effect of known risk factors such as those related to lack of road space like vehicle size.

No competing interests declared.