Figures
Abstract
Standardized mean differences (SMDs) are frequently used to appraise the effects of psychological treatments, and to combine them in meta-analyses. Yet, there is no consensus on how exactly SMDs should be computed from randomized trials. In this study, we show that different SMD variants can heavily diverge in aggregate-data meta-analyses, subverting the original purpose of standardization. We investigate the impact this has on the estimated benefits of depression psychotherapies. Different SMD versions using endpoint or change scores were calculated from a comprehensive database of randomized trials, comparing depression psychotherapy against pharmacotherapy and inactive controls. Pooled treatment effects were obtained for each variant, assuming correlations between baseline and endpoint scores of 0.2 through 0.8, and their relationship was examined using bivariate meta-analyses. We also investigated which study characteristics predicted divergent effect estimates. A total of k = 443 trials with 48,221 participants were analyzed. The pooled effect of psychotherapy versus controls varied heavily depending on the calculation methods (SMD = 0.65–1.24), even though the same studies were used. Divergences were less pronounced for psychotherapies compared to pharmacotherapy (SMD = 0.05–0.14). Change score SMDs deviated from endpoint SMDs especially when high (r = 0.8) or low (r = 0.2) pre-post correlations were assumed. This difference was largest in subfields with high treatment effects. Different SMD calculation methods can lead to strongly diverging effect estimates of psychological treatment; especially when change scores are used and pre-post correlations are very high or low. This could have a profound impact on how treatment benefits are interpreted within and across meta-analyses. Researchers could prioritize endpoint SMDs of randomized trials, and should consider standardization using population-level estimates to improve the comparability of meta-analytic effects in the field.
Open Material; Registration: htpps://doi.org/10.5281/zenodo.10694719; https://osf.io/yx5jg; https://osf.io/4j23t.
Citation: Harrer M, Miguel C, Luo Y, Ostinelli EG, Karyotaki E, Leucht S, et al. (2025) Standardized effect sizes are far from “Standardized”: A primer and empirical illustration in depression psychotherapy meta-analyses. PLOS Ment Health 2(7): e0000347. https://doi.org/10.1371/journal.pmen.0000347
Editor: Gareth Hagger-Johnson, UCL: University College London, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: December 12, 2024; Accepted: May 15, 2025; Published: July 1, 2025
Copyright: © 2025 Harrer et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All code used for the analyses is openly available on Zenodo (https://doi.org/10.5281/zenodo.10694719).
Funding: There was no funding for the present work. Outside the present work, EGO was supported by the National Institute for Health and Care Research (NIHR) Research Professorship (grant RP-2017-08-ST2-006), by the National Institute for Health Research (NIHR) Applied Research Collaboration Oxford and Thames Valley (ARC OxTV) at Oxford Health NHS Foundation Trust, by the NIHR Oxford Health Clinical Research Facility, by the NIHR Oxford Health Biomedical Research Centre (grant BRC-1215-20005), and by the Brasenose College Senior Hulme scholarship. The views expressed are those of the authors and not necessarily those of the UK National Health Service, the NIHR, or the Department of Health and Social Care.
Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: MH is a part-time employee of Get.On Institut GmbH/HelloBetter, a company that implements digital mental health interventions into routine care. EGO received research and consultancy fees from Angelini Pharma. In the last three years, SL has received honoraria for advising/consulting and/or for lectures and/or for educational material from Angelini, Apsen, Boehringer Ingelheim, Eisai, Ekademia, GedeonRichter, Janssen, Karuna, Kynexis, Lundbeck, Medichem, Medscape, Mitsubishi, Neu-rotorium, Otsuka, NovoNordisk, Recordati, Rovi, Teva. All other authors have declared that no competing interests exist.
Introduction
In mental health research, rating scales are widely used to assess patients’ symptom severity. Measurements are typically obtained by creating a sum score of all evaluated items, a practice that is not uncontroversial [1]. Such scales are also common in randomized controlled trials (RCTs), where the instrument is administered in both the intervention and control group at one or several assessment points. Randomization ensures exchangeability of the average potential outcome values, meaning that the causal treatment effect is identifiable by comparing the expected sum score of both groups at the same point in time [2].
There are different methods to calculate this estimand in practice, typically resulting in an estimated mean difference between the two groups. This value can then be used to assess the size of the treatment effect. However, for most mental disorders (including depression [3]), several symptom inventories are available, and if trials employed different scales, their mean differences cannot be directly compared. This has led to a widespread adoption of “standardized” effect sizes, most notably the standardized mean difference (SMD; Cohen’s d). To standardize the effect, the mean difference is divided by a standardizing denominator, usually the pooled standard deviation (SD) of the sample. This step is essential when trials using different rating scales are combined in meta-analysis, and often employed in the social sciences [4]. Even when the same scale was used, SMDs may sometimes be favored over unstandardized MDs. For instance, health professionals are more likely to interpret SMDs correctly [5,6], and SMDs may also offer somewhat greater generalizability compared to MDs (defined as lower cross-study variability not attributable to sampling error, and greater agreement in effect estimates across studies [7]).
While commonly used, this standardization is not without flaws. First, it introduces ambiguity concerning which pooled SD should be selected (i.e., baseline, change, or endpoint scores). Moreover, it creates additional variability because SDs are estimated from individual trials, which often have limited sample sizes, and may differ in how narrowly defined their patient population was [8,9]. Previous work indicates that SMD calculation methods in the literature are very heterogeneous, including in psychiatric research; and that effect estimates can vary dramatically within the same study [10].
If used in meta-analyses, SMDs can have a substantial impact on an entire research field, and they are often used to inform treatment guidelines. This warrants a closer look if different ways to calculate a “standardized” effect will indeed yield consistent results. To this end, we first review approaches to calculate treatment effects from an RCT, including methods often used in aggregate-data meta-analyses. Then, we derive how and when these methods predict differences in “standardized” expressions of the effect.
SMD calculation methods: A primer
For RCTs with pre-test measure and post-test measure
, a frequently used approach to estimate the average treatment effect
is an analysis of covariance (ANCOVA). ANCOVA implies a linear model in which
is regressed on a treatment indicator
and (one or multiple) baseline measures
. Here, we assume that
is the pre-test measure of the (continuous) outcome, and the only variable to be controlled for in the model. This gives us the following conditional expectation for
[2,11]:
In (1) above, represents the control or intervention group, with
, and with
being the average treatment effect; while
quantifies the slope between the pre- and post-test scores. ANCOVA-type models are frequently recommended in the analysis of RCTs, since adjustment for the baseline scores controls for between-group differences in prognostically relevant variables (i.e., realized confounding [12,13]) and improves power [14–16].
Covariate adjustment typically requires that individual participant data (IPD) is available for the trial. However, in meta-analyses of aggregate data, only the group means or group-wise mean change from baseline may be reported instead. This means that meta-analysts may be forced to obtain effect estimates from change or endpoint scores only, without any further adjustment. Both approaches can be re-expressed as special cases of the ANCOVA model given in (1).
First, we assume that the difference in change scores between the two groups is used to determine the treatment effect. Following Laird [17], we can rearrange (1) so that:
This equation offers two insights. Firstly, it emphasizes that using change scores as the outcome will yield identical results to a standard ANCOVA if baseline scores are additionally controlled for. Secondly, it shows that a “crude” analysis of change scores without baseline adjustment will only be identical to ANCOVA when the slope between pre-test and post-test scores is exactly one (since
will only drop out of the equation when
). Sometimes, it is assumed that using change scores will control for baseline symptom severity; (2) above shows that this only holds when
is exactly one, which is unlikely to occur in practice [11].
If an unadjusted analysis of the endpoint scores is used, (1) reduces to:
Effectively, this approach ignores the relationship between pre- and post-test scores, thus setting In RCTs, this approach remains asymptotically unbiased but is generally less efficient than ANCOVA with baseline score adjustment [13]. Notably, mean differences derived from both change scores (equation 2) and endpoint scores (equation 3) provide unbiased estimates of the treatment effect in successfully randomized trials, though their efficiency may vary [11].
A further complication in meta-analyses is that different instruments (e.g., depression scales) may have been used across studies, and that mean differences estimated from each trial are therefore not comparable. This can be resolved by calculating a “unit-free” [18] measure of the effect, viz., the SMD. A generic definition of this standardized effect
is [19]:
where and
are the (independent) population-level means of two populations, and
is the SD based on either population (where
). A practical problem is what empirical estimates ought to be plugged into equation (4). In an analysis of endpoint scores, the SMD is typically calculated using this formula [20]:
Where the second part of the formula applies a small-sample bias correction, with function defined as:
where is the gamma function and
the degrees of freedom. This small sample bias corrected SMD is commonly known as Hedges’ g. The standardizing denominator
in (4) is the pooled SD of the endpoint
in both groups:
The SMD based on change scores can be calculated in a similar manner:
However, since makes use of both
and
, the standardizing denominator is less clearly defined. Some propose that the pooled pre-test SD should be used [21,22]; while others define
as the SD of the change scores [11,23]:
Where is the intervention or control group, and
is the in-sample correlation coefficient between the pre- and post-test scores. A practical problem with (9) above is that such trial-specific correlation coefficients are rarely reported; the value of
obtained using this method will therefore heavily depend on the value imputed for
. In meta-analyses,
values can be imputed using representative values from the literature, or sourced from other studies in the meta-analysis that provide empirical estimates. In some cases,
may also be approximated from other reported summary statistics [24].
Senn [11,25] shows that, under some simplifying assumptions, the different analytical strategies (ANCOVA, analysis of endpoint scores, analysis of change scores) are strictly related. Given equal variances of the baseline and endpoint scores (), as well as equal pre-post correlations
and sample sizes in both groups, we obtain the following equality for effect estimates of the three approaches:
This shows that effect estimates based on change scores () will be closer to the ANCOVA estimate when the pre-post correlation is high (i.e.,
> 0.5). If the correlation is lower (
< 0.5), the unadjusted endpoint estimate (
) will be closer.
Under these assumptions, we can also directly define the relationship between the SD estimates that are used to “standardize” the mean difference in the denominator:
This equation again underlines the importance of the pre-post correlation: for large correlations (i.e., > 0.5),
will be larger than
; thus, given the same estimated mean difference,
>
. This reverses for smaller correlations (
< 0.5):
is now smaller than
, so that
<
. Importantly, (11) above also shows that, in almost all contexts, the SD of the ANCOVA model will be smaller than the one based on change scores or endpoints only (since
and
). This relationship also translates to the sampling variances
,
, and
[11]. In sum, this shows that different SMD calculation methods can produce widely varying results, even when the true treatment effect is the same. These discrepancies primarily stem from the choice of standardizing denominator. In S1 Text, we provide a visual summary of these predicted differences as obtained by a simulation study.
The same definitions as shown above are also given in an influential treatment by Cohen [26]. However, Cohen discusses these different ways to obtain the standardizing denominator in the context of power analyses. In this setting, it is clearly sensible to adapt the standardizing divisor to the analytic approach to be used in the study: given the same sample size and mean difference, adjusting for baseline will yield higher power estimates because it decreases , thus yielding a higher SMD to be considered in the power analysis. Adjusting for covariates in an ANCOVA will almost always increase the statistical power compared to a crude analysis of endpoint means; for change scores, this will only be the case if the pre-post correlation is high.
It is questionable if this rationale translates well into the context of meta-analyses. Different ways to obtain the standardizing denominator mean that effect estimates will diverge depending on the method that was used to calculate the SMD (change score or endpoints), and the pre-post correlation that meta-analysts are willing to assume. Depending on what approaches are used, this may heavily limit the comparability of effect sizes within and across meta-analyses. Researchers could seriously over- or underestimate the efficacy of a treatment if SMD estimates are compared to other trials or meta-analyses using a different standardization method, or if wrong assumptions about the pre-post correlation are made. Results of our “toy” simulation shown in S1 Text further illustrate this issue.
Cohen himself remarked on the limited transportability of standardized effect sizes [27,28]. SMDs and similar measures create a dependency between the effect size and the variability of a specific sample; this means that two identical patients with the same causal treatment benefits (e.g., a 5-point decrease on the PHQ-9 compared to not receiving treatment) would be judged to have experienced drastically different treatment effects if one was part of group that varies greatly, and the other part of a group with hardly any variation. This issue also extends to the different standardizing denominators we mentioned above: given the same aggregate data, the size of a treatment effect entered into meta-analysis will depend on the method used to obtain the SMD, and how efficient this approach is in the specific context of the study. Such a context-dependent divergence of identical causal treatment effects is clearly undesirable.
Aims of the current study
It is important to note that the strict relationships between different calculation methods described in (10) and (11) are based on several simplifying assumptions (homogeneity of ; equal correlations and sample sizes in both groups). This is unlikely to hold in practice. More generally, it is uncertain what the real impact of these divergences will be in fields such as meta-analytic psychotherapy research, where SMDs are commonly used; and which types of studies and treatments are most affected. A previous meta-epidemiological study indicates that SMD estimates can vary strongly within the same trial, especially in studies with small sample sizes and high treatment effects; but no approach appeared to produce consistently smaller or higher values [10]. In this study, we therefore aim to systematically investigate the impact of different SMD calculation methods on the estimated meta-analytic effect of psychotherapy for depression. Focusing on aggregate-data information reported in the publications, we will examine different approaches to obtain the mean difference between groups (endpoint scores versus change scores), as well as the standardizing denominator (pooled pre-test, change score, or endpoint SD), and examine how strongly these SMD variants can diverge from each other. We will also assess the influence of pre-post correlations (ranging from low to high) that meta-analysts may be willing to assume when calculating the SMDs.
Method
A preregistration of our investigation has been published with the Open Science Framework (osf.io/4j23t). All code used for the analyses is openly available on Zenodo (doi.org/10.5281/zenodo.10694719).
Datasets
Our analyses are based on two meta-analytic databases included in the “Metapsy” meta-analytic research domain (MARD) for psychological treatments [29,30] (metapsy.org). The Metapsy MARD provides comprehensive living databases of randomized trials for various indications and treatments, which are harmonized using a unified protocol [31]. The Metapsy databases have been used for more than 100 meta-analytic reviews published within the last 15 years [30,32] (see metapsy.org/published-articles for an overview). The exact search strategy, data extraction and coding for each database is detailed in the documentation page of the initiative (docs.metapsy.org/databases).
This study focuses on the “Depression: Psychotherapy vs. Control” (docs.metapsy.org/ databases/depression-psyctr) and “Depression: Psychotherapy vs. Pharmacotherapy” datasets, which are compiled using the same methodology [33]. The most recent update of the databases was used, including studies published until May 1st, 2023. Both database versions can be downloaded online (doi.org/10.5281/zenodo.15584092; “data” folder). In both databases, risk of bias is rated using four criteria of the “Risk of bias” (RoB) assessment tool, version 1, developed by Cochrane [34]. Assessed domains include the adequate generation of allocation sequence; the concealment of allocation to conditions; the prevention of knowledge of the allocated intervention (masking of assessors); and dealing with incomplete outcome data (this was assessed as positive when intention-to-treat analyses were conducted, meaning that all randomized patients were included in the analyses). Trials are judged as having a low risk of bias when they score positive on all four domains. Psychological treatments are categorized into one of eight types based on a pre-specified rationale [35]. Extractions also include group-wise attrition, defined as the number of participants who were lost to follow-up.
For both datasets, results of all depression symptom instruments are extracted from each trial. A pre-specified hierarchy is used when extracting the effect size data, giving priority to the raw mean, SD and sample size of each condition at baseline and follow-up.
Because our analysis focused on comparing different SMD calculation methods with each other, we only considered studies for which the arm-specific mean, SD and sample size at baseline and endpoint were available. We excluded studies that reported the mean change scores and their standard deviation directly, but not the means, SDs or sample sizes of scores at baseline and the endpoint, because some SMD variants cannot be not directly calculated from them. In the “Depression: Psychotherapy vs. Pharmacotherapy” dataset, we additionally excluded trial arms that did not employ psychotherapy or ADM as a monotherapy (e.g., combined therapy, psychotherapy and pill placebo, ADM and pill placebo).
Calculation of change scores
Arm-specific mean change scores (mCS) in the eligible trials were calculated by subtracting the mean depressive symptoms score at baseline from the endpoint mean (mEP – mBL). We assumed that sample sizes for the change score (nCS) were identical to the sample size available at the endpoint (nEP). We also calculated the SD of the change scores (SDCS), using the formula given in (9), and assuming different pre-post correlations (see below).
Calculation of SMDs
In all included studies, we first calculated the SMD using endpoints means, which were standardized by the pooled SD of the endpoint scores (SMDEP/EP). Then, we also calculated three SMD variants based on change scores, dividing by either the (i) pooled pre-test SD (SMDCS/BL), (ii) pooled change score SD (SMDCS/CS), or (iii) pooled post-test SD (SMDCS/EP). SMDEP/EP, SMDCS/BL, SMDCS/CS and SMDCS/EP represent distinct estimators, differing in both their numerator (endpoint vs. change scores) and denominator (baseline, endpoint, or change score SD), which may lead to substantial variations in the numeric value of the resulting effect size. All SMD versions were adjusted for small-sample bias using the correction factor described in equations (4) and (5).
The sampling variation V of SMDEP/EP was calculated via the unbiased estimator given by Viechtbauer [28]:
with . The following delta method approximation was used for the sampling variances of SMDCS/CS and SMDCS/EP:
For SMDCS/BL, we used the formula derived by Morris [21]:
Calculation of SMD versions based on change scores requires the value of the pre-post correlation to be imputed. In this analysis, we considered a range of possible correlation values [36], leading to a total of
variants of SMDCS being calculated for each comparison.
Meta-analysis
For each of the SMD variants, we calculated the pooled effect of psychotherapy versus control groups, and of psychotherapy versus ADM, on depressive symptom severity. Different pooling models were considered: (i) a three-level “correlated and hierarchical effects” (CHE) model, assuming a constant sampling correlation of =0.6 for effect sizes clustered within studies [37]; (ii) a generic inverse-variance random-effects pooling model, for which multiple effect sizes within studies were pre-aggregated to avoid double-counting (again assuming
=0.6); (iii) the same model as in (ii), but only using the highest or lowest effect size within a study; and (iv) the same model as in (ii), but excluding outliers and influential cases determined using the “leave-one-out” diagnostics by Viechtbauer and Cheung [38] (employing the same “rules of thumb” for outlier identification as used in the influence.rma function in metafor [39]). The restricted maximum likelihood (REML [40]) estimator was used to calculate the heterogeneity variance (components)
. The Knapp-Hartung adjustment was applied to the pooled effect in models (ii) to (iv) [41]. For model (i), cluster-robust variance estimation (CR2 estimator [42]) was used instead.
To examine the relationship between SMDEP/EP and the different SMDCS variants, we re-used the pre-aggregated effect estimates obtained for models (ii) to (iv) to perform a bivariate meta-analysis [43]. This model allows each trial to contribute two effect estimates, SMDEP/EP and one SMDCS variant, the true values of which are likely to be correlated. An unstructured heterogeneity variance-covariance matrix was used in the model, which allows the covariance between true effect sizes based on SMDEP/EP and SMDCS to be estimated across studies. We then used the results to regress the estimated true effects based on SMDEP/EP on the ones using the SMDCS variant. Ideally, SMDEP and SMDCS should not diverge, meaning that the estimated slope in this model should be exactly one. Thus, we also tested if the slope deviated significantly from this value. Bivariate models were fitted for each combination of SMDEP/EP and the SMDCS variants, and for all correlation values assumed in the effect size calculation step (i.e., r = 0.2, 0.4, 0.6, and 0.8). To facilitate computations, while modelling the correlation of SMDEP/EP and SMDCS across trials, variances of the two SMD estimates were treated as conditionally independent within the same trial.
In a last step, we extended the bivariate models to examine if effect size divergences are moderated by study-level covariates. Examined moderators were (i) attrition (pooled across both groups); (ii) baseline imbalance (defined as the absolute value of the between-group SMD at baseline); (iii) the number of domains assessed to have a low risk of bias (range: 0–4); (iv) the type of control group; and (v) the type of psychological treatment used in the trial. This analysis was restricted to the “Depression: Psychotherapy vs. Control” database, for which a substantially larger number of studies was available.
All analyses were conducted in R version 4.2.0, using the metapsyTools package [44]. This extension imports functionality from the meta [45], metafor [39], dmetar [46] and clubSandwich [47] packages.
Results
A total of k = 532 trials were available in the two databases (psychotherapy versus control: k = 466; psychotherapy versus ADM: k = 66). After removing studies without reported pre- and post-test means, SDs or sample sizes, k = 443 (83.3%) RCTs could be included in the analysis (psychotherapy versus control: k = 395, 84.8%; psychotherapy versus ADM: k = 48, 72.7%). In total, these trials enrolled 48,221 patients (psychotherapy versus control: 40,871; versus ADM: 7,350) and reported 902 effect measures (psychotherapy versus control: 791; versus ADM: 111). References for all included studies are provided in S2 Text. Overall, 156 (35.2%) trials met all four criteria for low risk of bias (psychotherapy versus control: 145, 36.7%; psychotherapy versus ADM: 11, 22.9%).
Pooled effects using the different SMD calculation methods are provided in Table 1. This table only shows results for the three-level CHE model; S1 Table and S2 Table give the results for all pooling models. Compared to the endpoint SMD estimate (SMDEP/EP = 0.78), effects of psychotherapy versus control groups were considerably higher when change score SMDs using baseline SDs were employed (SMDCS/BL = 0.92 to 0.93). Only small a difference emerged when change score SMDs were standardized by the post-test SD (SMDCS/EP = 0.82 to 0.83). For change score SMDs standardized by the change score, divergences heavily depended on the assumed correlation. We found much higher effect estimates for r = 0.8 (SMDCS/CS = 1.24, vs. SMDEP/EP = 0.78) and r = 0.6 (SMDCS/CS = 0.92); but comparable effects when r = 0.4 (SMDCS/CS = 0.76), and slightly lower results for r = 0.2 (SMDCS/CS = 0.65). The estimated total heterogeneity variance mirrored this pattern, leading to higher values (=1.221, vs. 0.487 for SMDEP/EP) when assuming r = 0.8, and to lower values (
0.259) for r = 0.2. Overall, the proportion of variation not attributable to sampling error was very large for all meta-analyses (I2 = 76.4% to 97.9%), even when outliers and influential cases were removed (I2 = 60.2% to 94.3%; see S1 Table).
Relationships between endpoint and change score SMDs as estimated using bivariate meta-analysis are visualized in Fig 1. For SMDCS/BL, estimated slopes ranged from =1.160 to 1.197, and differed significantly from one (all p < 0.05). For SMDCS/CS, the estimated slopes were
=0.868 (r = 0.2), 0.973 (r = 0.4), 1.150 (r = 0.6), and 1.544 (r = 0.8), all diverging significantly from unity. For SMDCS/EP, the estimated slopes ranged from
=1.027 to 1.063 (all p < 0.05).
Pooled effects of psychotherapy versus ADM were more comparable across calculation methods (SMD = 0.05 to 0.14). However, for SMDCS/CS, estimates of the total heterogeneity again heavily depended on the chosen correlation ( = 0.181 to 0.922), with higher imputed correlations producing higher heterogeneity estimates. The percentage of variation not attributable to sampling error was also large (I2 = 69.3% to 97.7%), and remained substantial even after outlier and influential case removal (I2 = 37.3% to 91.9%; see S2 Table). Effect divergences for this dataset are provided in the bottom row of Fig 1. Despite the lower overall effect, the estimated slopes again displayed a similar pattern. For SMDCS/BL, all slopes were significantly larger than one (
=1.171 to 1.173, all p < 0.05). Divergences of SMDCS/CS again varied heavily depending on the chosen correlation (
=0.870 to 1.643); for r = 0.4, true effects correlated almost perfectly (
=0.997). For SMDCS/BL, slopes were slightly larger than one (
=1.093 to 1.094; all p < 0.05).
S1 Fig shows differences between SMD calculation methods conditional on study and treatment characteristics. Results were consistent with the overall pattern established via bivariate meta-analysis. First, divergences were greatest for SMDCS/CS when high correlation values (i.e., r = 0.8) were chosen (ΔSMD = 0.35 to 0.52). Results for SMDCS/BL were less pronounced, but still led to significant effect differences for all comparators, and for some treatments (behavioral activation, cognitive behavior therapy, interpersonal therapy, problem solving therapy; ΔSMD = 0.12 to 0.28). Across moderators, differences between SMDCS/EP and SMDEP/EP were mostly small and not significant. Second, we found that effect divergences were generally higher among subgroups of studies that generally produce high effect estimates. For SMDCS/BL (r = 0.2 to 0.6) and SMDCS/CS (r = 0.6 and 0.8), we found a significant moderating effect of study quality, whereby distances to SMDEP/EP increase with the number of domains judged to have a high or unclear risk of bias. Baseline imbalance had no impact for SMDCS/BL and SMDCS/CS, but predicted higher divergences when using SMDCS/EP. We did not find that study attrition (i.e., the proportion of patients lost to follow-up) had a significant influence on effect size differences, except when assuming r = 0.8 for SMDCS/CS. Comprehensive results are tabulated in S3 Table.
Discussion
In this study, we examined the impact of different SMD calculation methods on the meta-analytic effects of depression psychotherapy. We examined different ways to compute the unstandardized mean difference (endpoint scores versus change from baseline), and different standardizing denominators. For psychotherapy compared to inactive control groups, results differed substantially depending on the calculation method, with SMDs ranging from 0.65 to 1.23. Such differences can have a major impact on the clinical interpretation of treatment effects: following Cohen’s “operational definition” [48], the lowest estimate would indicate a medium-sized effect of psychotherapy, while the highest estimate represents a very large effect; even though the same data was used. These variations present a considerable risk if meta-analytic results are naïvely compared to other reviews or trials using a different SMD calculation method. Our findings illustrate, as others have before [10,28], that SMDs can be much less “standardized” than their name suggests.
Our results co-align with statistical theory, which predicts divergences between SMD estimates depending on the denominator used to “standardize” the effect (see, e.g., results of our simulation in S1 Text, which closely mirror our empirical findings). Holding the raw mean difference constant, SMDs will increase as the SD in the denominator decreases, and this effect will be most pronounced when the raw mean difference is large. This will often be the case in trials comparing effective treatments against “weak” comparators (e.g., waiting list, placebo, or other inactive controls [49]). This may also explain why divergences were considerably smaller for comparisons of psychotherapy to ADM (SMD = 0.05 to 0.14), where only minor between-group differences are typically found. However, even in this dataset, SMD variants could lead to markedly dissimilar estimates of the heterogeneity variance. Standardization could also explain our observed difference between endpoint and change score SMDs when the latter are divided by the pre-test SD. Compared to post-test, pre-test SDs in RCTs may often be restricted due the application of cut-offs or floor effects, which leads to higher SMDs on average. Consistent with this, smaller divergences were found when change score SMDs were calculated using the post-test SD instead. When change score SDs were used in the denominator, we found a very strong dependence on the pre-post correlation; pooled effects compared to inactive controls differed by ΔSMD = 0.58 depending on whether high (r = 0.8) or low (r = 0.2) values were assumed.
This is problematic for aggregate-data meta-analyses. In-sample correlations will seldom be reported for every study, and therefore must be imputed using a sensible “guesstimate” from the literature. A previous review reported a median pre-post correlation of r = 0.36 for psychiatric interventions (25th percentile: 0.22; 75th percentile: 0.58), the lowest among all fields of medicine [36]. This value is close to r = 0.4, for which divergences between and
in our analyses were smallest, at least when change score SDs were used. In meta-analyses where the correlation can be obtained from each study (or individual participant data is available), differences between SMDEP/EP and SMDCS/CS might therefore often be less pronounced. A recent investigation using IPD meta-analysis confirmed this [50], and further research may be helpful to corroborate this finding. Major discrepancies may still be possible if pre-post correlations vary considerably across trials, or if there are subfields with persistently higher or lower within-group correlations.
Table 2 presents practical recommendations for calculating SMDs using aggregate data from mental health trials. Overall, we believe our results seriously question the usefulness of change score SMDs – at least in meta-analytic psychotherapy research. Contrary to common belief, this calculation method does not adequately control for patients’ baseline symptomatology in most cases; yet it creates considerable ambiguity as to what plug-in estimator should be used in the standardizing denominator. Some have proposed that the pre-test SD should be used for this purpose [21,22], but our findings indicate that SMDCS/BL can lead to substantially larger effect estimates than SMDEP/EP. A recent simulation study suggested that change score SMDs may be less biased for studies with attrition at follow-up [22], but we did not find that this had a significant impact on the relationship between change score and endpoint SMDs; neither did the strength of baseline imbalance within studies. If change score SDs are used instead, the (pooled) SMD will strongly depend on how efficient change scores are as estimators of the true treatment effect compared to endpoint scores, and this is largely determined by the (true or “guesstimated”) pre-post correlation we happen to find in a specific trial (cf. equation 11). Arguably, none of these are desirable properties for a “standardized” effect that should facilitate comparing results across trials, treatments, or research fields.
There is less ambiguity concerning the calculation of endpoint SMDs (viz., SMDEP/EP). Also, should r = 0.36 hold as a generally representative value for psychiatric contexts, endpoint mean differences might come closer to ANCOVA-based estimates (cf. equation 10). Prioritizing endpoint SMDs in meta-analyses should also be practically feasible; for example, post-treatment means and SDs could be extracted from 88.4% (inactive controls) and 74.8% (ADM) of all trials included in the depression psychotherapy databases we analyzed here. In trials that only report the change from baseline, mean differences could also be standardized by the endpoint SD, since this led to only minor differences in our analysis. If not reported, there is empirical support for borrowing endpoint SDs from the other studies [51].
In our analysis, 16.7% to 27.3% of trials had to be removed because they did not report group means and standard deviations at both pre- and post-test. This indicates that, in general, outcome reporting in psychotherapy trials needs to be improved. To enhance transparency, researchers may also provide a pre-specification of the SMD calculation method they plan to employ, to prevent selective reporting of the one variant yielding the highest effect size. In this context, it is important to re-emphasize that none of the SMD calculation methods we examined here is inherently “wrong” or biased. Researchers may still select a different variant than SMDEP/EP for their analysis; but this should be clearly described, since it could limit the comparability of the effect size. We also want to underline that our recommendations in this paper are purely pragmatic, and may not translate to every context in mental health research. Examples include quasi-experimental designs, or sub-fields in which pre-post correlations are typically reported.
Finally, one should not gloss over the fact that even endpoint SMDs use a “plug-in” estimate of the population SD, which will depend on the overall variability in the trial. SDs may still differ between tightly controlled studies and, say, pragmatic trials with broad inclusion criteria. It has been recommended that, instead of computing study-specific estimates, meta-analysts should employ external SD estimates, with the same reference value used for each scale ([52]; so a common SD for, e.g., the Patient Health Questionnaire, Hamilton Depression Rating Scale, Beck Depression Inventory, etc.). This could further improve the comparability and transportability of effect estimates beyond a single trial.
Naturally, the optimal solution would be if all measurements in RCTs were standardized to begin with. There are increasing efforts to establish core outcome sets (COS [53]) to be included in all clinical studies within a research field, including psychological treatment [54–56]. Most “perils of standardization” we examined in this paper could be avoided altogether if such standards were more widely adopted.
Supporting information
S1 Fig. Estimated relationship between true effect sizes for different calculation methods of the SMD.
ΔSMD = Difference between the pooled effect based on endpoint SMDs (SMDEP/EP), and SMDs calculated using change scores (SMDCS). “Attrition” refers to the proportion of participants who were lost to follow-up, pooled across both trial arms (continuous covariate); “Baseline Imbalance” to the absolute value of the between-group SMD at baseline (continuous covariate); and “Risk of Bias” to the number of domains assessed to have a low risk of bias (continuous covariate; 0–4).
https://doi.org/10.1371/journal.pmen.0000347.s001
(TIFF)
S1 Text. True effect and estimated SMD conditional on calculation methods (Simulated example).
No legend.
https://doi.org/10.1371/journal.pmen.0000347.s002
(PDF)
S2 Text. References of the included studies.
No legend.
https://doi.org/10.1371/journal.pmen.0000347.s003
(PDF)
S1 Table. Pooled effects of psychotherapy versus control groups, based on different calculation methods of the SMD.
SMDEP/EP = SMD calculated by dividing the mean endpoint difference by the pooled endpoint SD; SMDCS/BL = SMD calculated by dividing the mean change score difference by the pooled baseline SD; SMDCS/CS = SMD calculated by dividing the mean change score difference by the pooled change score SD; SMDCS/EP = SMD calculated by dividing the mean change score difference by the pooled endpoint SD.
https://doi.org/10.1371/journal.pmen.0000347.s004
(PDF)
S2 Table. Pooled effects of psychotherapy versus pharmacotherapy trials, based on different calculation methods of the SMD.
SMDEP/EP = SMD calculated by dividing the mean endpoint difference by the pooled endpoint SD; SMDCS/BL = SMD calculated by dividing the mean change score difference by the pooled baseline SD; SMDCS/CS = SMD calculated by dividing the mean change score difference by the pooled change score SD; SMDCS/EP = SMD calculated by dividing the mean change score difference by the pooled endpoint SD.
https://doi.org/10.1371/journal.pmen.0000347.s005
(PDF)
S3 Table. Divergent effect estimates, conditional on study and treatment characteristics (Psychotherapy versus Control).
ΔSMD = Difference between the pooled effect based on endpoint SMDs (SMDEP/EP), and SMDs calculated using change scores (SMDCS). “Attrition” refers to the proportion of participants who were lost to follow-up, pooled across both trial arms (continuous covariate); “Baseline Imbalance” to the absolute value of the between-group SMD at baseline (continuous covariate); and “Risk of Bias” to the number of domains assessed to have a low risk of bias (continuous covariate; 0–4).
https://doi.org/10.1371/journal.pmen.0000347.s006
(PDF)
References
- 1. Fried EI, Nesse RM, Zivin K, Guille C, Sen S. Depression is more than the sum score of its parts: individual DSM symptoms have different risk factors. Psychol Med. 2014;44(10):2067–76. pmid:24289852
- 2. Harrer M, Cuijpers P, Schuurmans LKJ, Kaiser T, Buntrock C, van Straten A, et al. Evaluation of randomized controlled trials: a primer and tutorial for mental health researchers. Trials. 2023;24(1):562. pmid:37649083
- 3. Fried EI, Flake JK, Robinaugh DJ. Revisiting the theoretical and methodological foundations of depression measurement. Nat Rev Psychol. 2022;1(6):358–68. pmid:38107751
- 4.
White IR, Schmid CH, Stijnen T. Choice of effect measure and issues in extracting outcome data. 1st ed. Boca Raton, FL and London: Chapman & Hall/CRC Press; 2021.
- 5. Heimke F, Furukawa Y, Siafis S, Johnston B, Engel R, Furukawa TA, et al. Understanding effect size – an international online survey among psychiatrists, psychologists, physicians from other medical specialities, dentists, and other health professionals. BMJ Ment Health. 2024.
- 6. Johnston BC, Alonso-Coello P, Friedrich JO, Mustafa RA, Tikkinen KAO, Neumann I, et al. Do clinicians understand the size of treatment effects? A randomized survey across 8 countries. CMAJ. 2016;188(1):25–32. pmid:26504102
- 7. Takeshima N, Sozu T, Tajika A, Ogawa Y, Hayasaka Y, Furukawa TA. Which is more generalizable, powerful and interpretable in meta-analyses, mean difference or standardized mean difference? BMC Med Res Methodol. 2014;14(1):1–7.
- 8.
Dias S, Welton NJ, Sutton AJ, Ades A. NICE DSU technical support document 2: a generalised linear modelling framework for pairwise and network meta-analysis of randomised controlled trials. National Institute for Health and Clinical Excellence; 2011.
- 9. Furukawa TA, Leucht S. Can we inflate effect size and thus increase chances of producing “positive” results if we raise the baseline threshold in schizophrenia trials?. Schizophr Res. 2013;144(1–3):105–8. pmid:23312551
- 10. Luo Y, Funada S, Yoshida K, Noma H, Sahker E, Furukawa TA. Large variation existed in standardized mean difference estimates using different calculation methods in clinical trials. J Clin Epidemiol. 2022;149:89–97. pmid:35654267
- 11. McKenzie JE, Herbison GP, Deeks JJ. Impact of analysing continuous outcomes using final values, change scores and analysis of covariance on the performance of meta-analytic methods: a simulation study. Res Synth Methods Wiley Online Library. 2016;7(4):371–86. pmid:26715122
- 12. Vander Weele TJ. Confounding and effect modification: distribution and measure. Epidemiol Methods. 2012;1(1):55–82. pmid:25473593
- 13.
Johansson P, Nordin M. Inference in experiments conditional on observed imbalances in covariates. In: The American Statistician. Taylor & Francis; 2022. p. 1–11.
- 14. Egbewale BE, Lewis M, Sim J. Bias, precision and statistical power of analysis of covariance in the analysis of randomized trials with baseline imbalance: a simulation study. BMC Med Res Methodol. 2014;14:49. pmid:24712304
- 15. Clifton L, Clifton DA. The correlation between baseline score and post-intervention score, and its implications for statistical analysis. Trials. 2019;20(1):43. pmid:30635021
- 16. U.S. Food and Drug Administration. Adjusting for covariates in randomized clinical trials for drugs and biological products. 2023. https://www.regulations.gov/docket/FDA-2019-D-0934
- 17. Laird N. Further comparative analyses of pretest-posttest research designs. The American Statistician Taylor & Francis. 1983;37(4a):329–30.
- 18.
Cohen J. Effect size (Chap 11: Some issues in power analysis). In: Statistical power analysis for the behavioral sciences. 2nd ed. Lawrence Erlbaum Associates; 1983.
- 19.
Hedges LV, Olkin I. Estimation of effect size from a single experiment. In: Statistical methods for meta-analysis 1st ed. Academic Press; 1985.
- 20. Hedges LV. Distribution theory for Glass’s estimator of effect size and related estimators. J Edu Stat. 1981;6(2):107–28.
- 21. Morris SB. Estimating effect sizes from pretest-posttest-control group designs. Organizational Res Methods SAGE Publications Inc. 2007;11(2):364–86.
- 22. Gnambs T, Schroeders U. Accuracy and precision of fixed and random effects in meta-analyses of randomized control trials for continuous outcomes. Res Synth Methods. 2024;15(1):86–106. pmid:37751893
- 23.
Higgins J, Thomas J. Imputing standard deviations for changes from baseline. In: Cochrane handbook for systematic reviews of interventions. 2023. https://training.cochrane.org/handbook/current/chapter-06#section-6-5-2-8
- 24. Jané MB, Harlow T, Khu C, Shah S, Gould T, Veiner E, et al. Extracting pre-post correlations for meta-analyses of repeated measures designs. 2024. https://archive.fo/tkHG1
- 25. Senn S. Baseline distribution and conditional size. J Biopharm Stat. 1993;3(2):265–76. pmid:8220409
- 26.
Cohen J. Qualifying dependent variables.In: Statistical power analysis for the behavioral sciences. 2nd ed. Lawrence Erlbaum Associates; 1983.
- 27. Cohen J. The earth is round (p<. 05). American Psychol American Psycholog Assoc. 1994;49(12):997.
- 28. Viechtbauer W. Approximate confidence intervals for standardized effect sizes in the two-independent and two-dependent samples design. J Edu Behav Stat. 2007;32(1):39–60.
- 29. Cuijpers P, Miguel C, Papola D, Harrer M, Karyotaki E. From living systematic reviews to meta-analytical research domains. Evid-Based Mental Health Royal College of Psychiatrists; 2022;25(4):145–7.
- 30. Cuijpers P, Miguel C, Harrer M, Plessen CY, Ciharova M, Papola D, et al. Psychological treatment of depression: a systematic overview of a ‘meta-analytic research domain’. J Affect Disord. 2023.
- 31.
Harrer M, Miguel C, Ballegooijen W van, Ciharova M, Plessen CY, Kuper P. Supersizing meta-analysis of psychological interventions: features and findings of the “Metapsy” meta-analytic research domain. 2024.
- 32. Cuijpers P. Four decades of outcome research on psychotherapies for adult depression: an overview of a series of meta-analyses. Canadian Psychol / Psychologie Canadienne US: Educational Publishing Foundation. 2017;58(1):7–19.
- 33. Cuijpers P, Karyotaki E. A meta-analytic database of randomised trials on psychotherapies for depression. 2020.
- 34. Higgins JP, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ British Med J Publish Group. 2011;343.
- 35. Cuijpers P, Karyotaki E, de Wit L, Ebert DD. The effects of fifteen evidence-supported therapies for adult depression: a meta-analytic review. Psychother Res. 2020;30(3):279–93. pmid:31394976
- 36.
Balk EM, Earley A, Patel K, Trikalinos TA, Dahabreh IJ. Empirical assessment of within-arm correlation imputation in trials of continuous outcomes. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012.
- 37. Pustejovsky JE, Tipton E. Meta-analysis with robust variance estimation: expanding the range of working models. Prev Sci. 2022;23(3):425–38. pmid:33961175
- 38. Viechtbauer W, Cheung MW-L. Outlier and influence diagnostics for meta-analysis. Res Synth Methods. 2010;1(2):112–25. pmid:26061377
- 39. Viechtbauer W. Conducting meta-analyses in R with the meta for package. J Stat Soft. 2010;36(3).
- 40. Viechtbauer W. Bias and efficiency of meta-analytic variance estimators in the random-effects model. J Educ Behav Stat. 2005;30(3):261–93.
- 41. Knapp G, Hartung J. Improved tests for a random effects meta-regression with a single covariate. Stat Med Wiley Online Library. 2003;22(17):2693–710. pmid:12939780
- 42. Pustejovsky JE, Tipton E. Small-sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. J Business Econo Stat. 2018;36(4):672–83.
- 43. van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat Med Wiley Online Library. 2002;21(4):589–624. pmid:11836738
- 44. Harrer M, Sprenger AA, Kuper P, Karyotaki E, Cuijpers P. metapsyData: access the meta-analytic psychotherapy databases in R. 2022. https://data.metapsy.org
- 45. Balduzzi S, Rücker G, Schwarzer G. How to perform a meta-analysis with R: a practical tutorial. Evid Based Ment Health. 2019;22(4):153–60. pmid:31563865
- 46.
Harrer M, Cuijpers P, Furukawa T, Ebert DD. Dmetar: companion R package for the guide ‘Doing meta-analysis in R’. 2019.
- 47. Pustejovsky J. clubSandwich: cluster-robust (Sandwich) variance estimators with small-sample corrections. 2022. https://CRAN.R-project.org/package=clubSandwich
- 48.
Cohen J. “Small”, “medium” and “large” d values. In: Statistical power analysis for the behavioral sciences. 2 ed. Lawrence Erlbaum Associates; 1983.
- 49. Michopoulos I, Furukawa TA, Noma H, Kishimoto S, Onishi A, Ostinelli EG, et al. Different control conditions can produce different effect estimates in psychotherapy trials for depression. J Clin Epidemiol Elsevier. 2021;132:59–70. pmid:33338564
- 50.
Ostinelli EG, Efthimiou O, Luo Y, Miguel C, Karyotaki E, Cuijpers P, et al. Combining endpoint and change data did not affect the summary standardised mean difference in pairwise and network meta-analyses: an empirical study in depression. 2024.
- 51. Furukawa TA, Barbui C, Cipriani A, Brambilla P, Watanabe N. Imputing missing standard deviations in meta-analyses can provide accurate results. J Clin Epidemiol. 2006;59(1):7–10. pmid:16360555
- 52. Gallardo-Gómez D, Pedder H, Welton NJ, Dwan K, Dias S. Variability in meta-analysis estimates of continuous outcomes using different standardization and scale-specific re-expression methods. J Clin Epidemiol. 2024;165:111213. pmid:37949198
- 53. Williamson PR, Altman DG, Blazeby JM, Clarke M, Devane D, Gargon E, et al. Developing core outcome sets for clinical trials: issues to consider. Trials. 2012;13(1):1–8.
- 54. Chevance A, Ravaud P, Tomlinson A, Le Berre C, Teufer B, Touboul S, et al. Identifying outcomes for depression that matter to patients, informal caregivers, and health-care professionals: qualitative content analysis of a large international online survey. Lancet Psychiatry. 2020;7(8):692–702. pmid:32711710
- 55. Prevolnik Rupel V, Jagger B, Fialho LS, Chadderton L-M, Gintner T, Arntz A, et al. Standard set of patient-reported outcomes for personality disorder. Qual Life Res. 2021;30(12):3485–500. pmid:34075531
- 56. Krause KR, Chung S, Adewuya AO, Albano AM, Babins-Wagner R, Birkinshaw L, et al. International consensus on a standard set of outcome measures for child and youth anxiety, depression, obsessive-compulsive disorder, and post-traumatic stress disorder. Lancet Psychiatry Elsevier. 2021;8(1):76–86. pmid:33341172