Post a new comment on this article
Post Your Discussion Comment
Please follow our guidelines for comments and review our competing interests policy. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided:
- Remarks that could be interpreted as allegations of misconduct
- Unsupported assertions or statements
- Inflammatory or insulting language
Why should this posting be reviewed?
See also Guidelines for Comments and Corrections.
Thank you for taking the time to flag this posting; we review flagged postings on a regular basis.close
RESULTS ABOUT ANTIDEPRESSANTS REMAIN ROBUST: AUTHOR REPLIES TO STATISTICAL COMMENTARIES
Posted by plosmedicine on 31 Mar 2009 at 00:25 GMT
Author: Tania Huedo-Medina
Institution: University of Connecticut, Psychology
Additional Authors: Professor Blair Johnson, Professor Irving Kirsch
Submitted Date: May 01, 2008
Published Date: May 1, 2008
This comment was originally posted as a “Reader Response” on the publication date indicated above. All Reader Responses are now available as comments.
In the time since we published our article about the efficacy of new generation antidepressants relative to placebo (1), some have raised doubts about our statistical methods, despite our earlier reply to commentaries (2). In this reply, we address these concerns and re-assert our original conclusion that the efficacy of antidepressants relative to placebo does not reach clinical significance criteria except in trials that sampled patients with extremely severe depression.
1. Effect size definitions and related analyses. One of the main concerns in the new commentaries centred on one of our main analyses, which evaluated change for drug and placebo groups without taking a direct difference between them. Thus, effect sizes were calculated separately for each group for this analysis, though the analysis combined them. Leonard regarded this practice as “unorthodox” and Wohlfarth regarded it as “erroneous because the effect size in an RCT is defined as the difference between the effect of active compound and placebo.” First, these concerns ignore the fact that our article’s between-group analyses confirmed the major trends present in the analyses that considered within-group change. Specifically, both sets of analyses concluded that antidepressants’ efficacy was greater at higher initial severity, attaining clinical significance standards only for samples with extremely severe initial depression. Second, although the commentators may be correct that our within-group analyses are relatively innovative in this literature, it does not mean that they were wrong. To the contrary, these statistics are in conventional usage elsewhere (e.g., 3, 4, 5), as Waldman’s commentary implies.
Wohlfarth is correct that it is key to compare the change between the two groups, but in fact it is quite possible to calculate other versions that map degree of change and these other methods can provide knowledge generalizations that the between-group comparison cannot. In particular, without the within-group analyses it would not have been possible to conclude that placebo responses were lessening as initial severity of depression increased (whereas drug response remained constant; see our article’s Figures 2 and 3). This unique contribution of our article contradicts Wohlfarth’s conclusion that it contained “nothing new.” (Further, this tendency was present even when the one sample of moderate depression patients was excluded, contrary to Wohlfarth’s statement that this tendency depended on the trial with moderate severity.) Moreover, without this analysis, it also would not have been possible to estimate the extent to which placebo effect size matches the drug effect size. The presence of multiple trials (rather than a single trial) also helps the analyses’ statistical power to “estimate the pure effect of treatment and of placebo within a single trial of this sort” (Young). Finally, the analyses did incorporate a direct contrast between drug and placebo (see Table 2, and Model 2c, for example).
2. Assumptions in estimates of precision for each effect size. It is also well-known that the standardised mean difference follows a non-central t-distribution (6), which has an approximate normal distribution when sample sizes are large enough (in the case of our meta-analysis, 5,133 patients would seem large enough). Waldman argued that our estimates of the overall difference between drug and placebo was conservatively biased (i.e., too small) because of assumptions present in our estimates of precision for each effect size. It is of course not possible to be certain that one has completely removed error from any measurement, or for that matter, to do so in an analysis of measures from independent trials. As Young noted, there are uncontrolled measurement errors or artefacts that necessitate the use of a control group and the randomised controlled trial design. Previous research about the placebo effect has documented that placebo conditions can be considered an intervention in and of themselves (e.g., 7). Because there are no pure control groups (e.g., wait-list controls), the amount of change that would be evident in a group with no intervention at all is not clear (8). All of these factors should be carefully weighed in interpretations.
Although alternative weighting strategies may yield somewhat different results, the choices converge well both for the overall mean difference and for analyses of the trends across the literature. As an example, Leonard (04 March 2008) reported replicating our meta-regression patterns using alternative precision weights.
Importantly, as our article documented (Figures 2 & 3), the size of the difference between drug and placebo grows as the samples’ initial severity increases to extremely severe depression (but is very small at lower observed levels of initial severity). Because the overall differences between drug and placebo depended on initial severity, it is misleading to consider the overall difference in isolation.
3. Choice of effect size metric. As we alluded above, drug trial results may be captured and analyzed in different yet converging ways. Our use of the standardised mean difference to examine within-group change permitted us to control for the differences in the underlying variances of the samples included in the trials, which varied widely across the trials. In this fashion, the amount of change in evidence was judged against the amount of variation exhibited within the drug and placebo arms of each particular trial. The calculation of a weighted effect size by using the inverse of each within-subjects variance is more precise than a sample-size weighted average (9), contrary to the Waldman’s assertion.
4. Clinical relevance. Contrary to popular wisdom, it is not just response rates or odds ratios that provide information about clinical relevance or significance. Indeed, response rates can create an illusion of clinical effectiveness (10) because differences in response rates dichotomizing the level of depression, depressed vs. non-depressed, do not indicate differences in the number of people who have improved between the two conditions. Instead, response rates indicate differences in the number of individuals whose degree of improvement has pushed them over a specified criterion. Patients who are at baseline classified as depressed may show substantial and clinically significant improvement following a therapy, and these are the patients who are most likely to become “non-depressed” when given medication. They achieve this status when 1.80 (on average) difference in HRSD improvement reported in our article pushes them over the “non-depressed” criterion. Thus, the group level results reflect individual change, because individual change underlies the observed group changes, contrary to Wohlfarth’s assertion that clinical relevance “can only be validly applied to change in an individual patient.” Were Wohlfarth’s assertion true, then the entire logic underlying drug trials—viz., drug vs. placebo group comparisions—would be nullified. To the contrary, confidence in scientific conclusions grows as more individuals are studied.
Clinical relevance can also be defined as the difference between the level of depression exhibited by an initially depressed sample compared to the normal population. After the depressed sample receives therapy, clinical significance may be defined as occurring when this sample’s levels of depression fall within the range of the normal population (where range is defined as within two standard deviations of the mean of that population; see (11)). Thus a comparison of our results against the scores on the scale for a normal population can give us an idea about the clinical significance of the treatment. Unfortunately, as we noted in our prior commentary (2), this comparison suggests that most patients in trials of antidepressants still exhibit much higher than normal levels of depression.
5. Assumptions of scaling. Tennant drew attention to the fact that some might regard the Hamilton rating scale of depression, which was central in our analyses, as an ordinal scale whereas the analytic techniques assume at least interval level of measurement (12, 13). Although some meta-analytic methods have been designed to analyse ordinal data (14, 15), these are appropriate when the scales have discretely different categories. In contrast, the Hamilton scale offers degrees of depression ranging from none to extremely severe and extensive quantitative research supports interval interpretations (e.g., 13). Because our meta-analysis had the complete Hamilton information for the relevant trials, it made sense to make use of it rather than to treat the data as though they are relatively coarse categories. Moreover, sample means achieve greater measurement precision with higher statistical power than individual cases and our analyses centred on the means rather than individual-level responses, similar to other meta-analytic practice (e.g., 16, 17).
In summary, we conclude that the analyses we reported are based on conventional assumptions and that the conclusions we reached about our results are robust. The efficacy of antidepressants relative to placebo does not reach clinical significance criteria except in trials that sampled patients with extremely severe depression.
Tania B. Huedo-Medina, Department of Psychology, University of Connecticut (Storrs, CT), USA
Blair T. Johnson, Department of Psychology, University of Connecticut (Storrs, CT), USA
Irving Kirsch, Department of Psychology, University of Hull, UK
(1) Kirsch I, Deacon BJ, Huedo-Medina TB, Scoboria A, Moore TJ, Johnson BT (2008) Initial severity and antidepressant benefits: A meta-analysis of data submitted to the food and drug administration. PLoS Medicine 5, 260-268. doi:10.1371/journal.pmed.0050045
(2) Johnson BT, Huedo-Medina TB, Kirsch I, & Deacon BJ (2008, 14 March) Initial severity and antidepressant benefits: Author replies to commentaries. http://medicine.plosjourn....
(3) Becker BJ (1988) Synthesizing standardized mean-change measures. British Journal of Mathematical and Statistical Psychology, 41, 257-278.
(4) Gibbons RD, Hedeker DR, Davis, JM (1993) Estimation of ES from a series of experiments involving paired comparisons. Journal of Educational Statistics, 18, 271-279.
(5) Morris SB, DeShon RP (2002) Combining ES estimates in meta-analysis with repeated measures and independent-groups designs. Psychological Methods, 7, 105-125.
(6) Hedges LV (1981) Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107-128.
(7) Price DD, Finniss DG, Benedetti F (2008) A comprehensive review of the placebo effect: Recent advances and current thought. Annual Review of Psychology, 59, 565-590.
(8) Kirsch I (2000) Are drug and placebo effects in depression additive? Biological Psychiatry, 47, 733-735.
(9) Sánchez-Meca J, Marín-Martínez F (1998) Weighting by inverse variance or by sample size in meta-analysis: a simulation study. Educational and Psychological Measurement, 58, 211-222.
(10) Kirsch, I., & Moncrieff, J. (2007) Clinical trials and the response rate illusion. Contemporary Clinical Trials, 28, 348-351
(11) Jacobson NS, Truax P (1991) Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 12-19.
(12) Steinmeyer EM, Moller H J (1992) Facet Theoretic analysis of the Hamilton-D scale. Journal of Affective Disorders, 25, 53-62.
(13) Uher R, Farmer A, Maier W, Rietschel M, Hauser J, Marusic A, Mors O, Elkin A, Williamson R J, Schmael C, Henigsberg N, Perez J, Mendlewicz J, Janzing JGE, Zobel A, Skibinska M, Kozel D, Stamp AS, Bajs M, Placentino A, Barreto M, McGuffin P, Aitchison KJ (2008) Measuring depression: Comparison and integration of three scales in the GENDEP study. Psychological Medicine, 38, 289-300.
(14) Whitehead A, Whitehead J (1991) A general parametric approach to the meta-analysis of randomised clinical trials. Statistics in Medicine, 10, 1665-1677.
(15) Whitehead A, Jones NMB (1994) A meta-analysis of clinical trials involving different classifications of response into ordered categories. Statistics in Medicine, 13, 2503-2515.
(16) Gotzsche PC (2001) Reporting of outcomes in arthritis trials measured on ordinal and interval scales is inadequate in relation to meta-analysis. Annals of the Rheumatic Diseases, 60, 349-352.
(17) Song F, Jerosch-Herold C, Holland R, Drachler ML, Harvey I (2006) Statistical methods for analysing Barthel scores in trials of poststroke interventions: A review and computer simulation. Clinical Rehabilitation, 20, 347-356.