Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Meta-regression to explain shrinkage and heterogeneity in large-scale replication projects

  • Rachel Heyard ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    rachel.heyard@uzh.ch

    Affiliation Center for Reproducible Science, Epidemiology, Biostatistic and Prevention Institute, University of Zurich, Zurich, Switzerland

  • Leonhard Held

    Roles Conceptualization, Methodology, Supervision, Writing – review & editing

    Affiliation Center for Reproducible Science, Epidemiology, Biostatistic and Prevention Institute, University of Zurich, Zurich, Switzerland

Abstract

Recent large-scale replication projects (RPs) have estimated concerningly low reproducibility rates. Further, they reported substantial degrees of shrinkage of effect size, where the replication effect size was found to be, on average, much smaller than the original effect size. Within these RPs, the included original-replication study-pairs can vary with respect to aspects of study design, outcome measures, and descriptive features of both original and replication study population and study team. This often results in between-study-pair heterogeneity, i.e., variation in effect size differences across study-pairs that goes beyond expected statistical variation. When broader claims about the reproducibility of an entire field are based on such heterogeneous data, it becomes imperative to conduct a rigorous analysis of the amount and sources of shrinkage and heterogeneity within and between included study-pairs. Methodology from the meta-analysis literature provides an approach for quantifying the heterogeneity present in RPs with an additive or multiplicative parameter. Meta-regression methodology further allows for an investigation into the sources of shrinkage and heterogeneity. We propose the use of location-scale meta-regressions as a means to directly relate the identified characteristics with shrinkage (represented by the location) and heterogeneity (represented by the scale). This provides valuable insights into drivers and factors associated with high or low reproducibility rates and therefore contextualises results of RPs. The proposed methodology is illustrated using publicly available data from the Replication Project Psychology and the Replication Project Experimental Economics. All analysis scripts and data are available online.

Introduction

In the last decade, numerous large-scale replication projects (RPs) were conducted to assess the reproducibility in, among others, the research fields of Psychology [1], Experimental Economics [2], Social Sciences [3], Experimental Philosophy [4] and Cancer Biology [5]. These projects selected a set of highly cited or influential papers in well established journals and attempted direct replications of the main results from the original studies. A setup where all selected original studies are replicated once is refered to as “Many Phenomena, One Study” [6]. A direct replication is defined as a study “following the methods presented in the original research as close as possible to retrieve new data and achieve consistent results using the same statistical analysis” [7]. The Replication Project Psychology [RPP, ] concluded that while 97% of the 100 original studies reported a significant result (with p-value <0.05), only 36% of the replication studies found a significant result with effect estimate in the same direction. 39% of the effects were subjectively rated by the replication teams to have replicated the original result. The Replication Project Experimental Economics [RPEE, ] was a much smaller effort compared to the RPP, with only 18 original-replication study-pairs. Here, all included original findings were statistically significant but only 61% of the replications found a significant effect in the same direction as the original study. The replication effect sizes were found to be significantly smaller in absolute value than the original effect sizes. This phenomenon is commonly known as shrinkage and has been observed in all large-scale replication projects mentioned above [8, 9]. Shrinkage in large-scale RPs has been attributed to a tendency of the original effect estimates being inflated due to publication bias or questionable research practices [10]. Further, between-study-pair heterogeneity, i.e. variation in effect size differences across study-pairs that goes beyond expected statistical variation, is often observed in RPs but generally ignored when the results are summarized. This led some to criticize the methodology employed by RPs as they tend to draw negative conclusions with respect to the reproducibility of the investigated research fields and their practices [6, 11]. Shrinkage and heterogeneity are directly linked to reproducibility rates and, while a certain degree of heterogeneity and shrinkage might not be avoidable, identifying covariates associated with high or low levels can put low reproducibility rates into context. In the following, we begin by discussing how heterogeneity can be assessed and how methods from meta-analysis have been applied in the analysis of RPs. We then introduce location-scale meta-regression as a tool to relate specific covariates to both shrinkage and heterogeneity in oder to draw more nuanced interpretations of the results from RPs.

.

Assessment of heterogeneity in the meta-analysis literature.

The concept of heterogeneity between studies is well understood in the meta-analysis literature [1214]. Most published meta-analyses test for between study heterogeneity and, if needed, account for it using, for example, a random-effects meta-analysis [15]. Heterogeneity can be quantified either using an additive or a multiplicative model. The additive model or random-effects meta-analysis, as implemented in the metafor R-package [16], is commonly employed in the meta-analysis literature and accounts for heterogeneity by adding a constant to each study’s variance. The multiplicative version, on the other hand, relies on a weighted linear regression model with weights equal to the inverse of the studies’ variances. The effect size returned by this multiplicative model is the same as the one from a fixed effect meta-analysis with variances multiplied by a constant , the multiplicative heterogeneity parameter [17]. The additive and multiplicative model mainly differ in their assumptions on the underlying effect sizes. The multiplicative model shares the same assumption of a single overall effect size with the traditional fixed effect meta-analysis. The additive model allows, like the random-effects meta-analysis, for some variability in the true effect sizes [18].

Once heterogeneity using either model version has been established and quantified, meta-regression can be used to investigate whether covariates account for part of the heterogeneity in effect sizes. As explained in Chapter 7 of Schmid et al. (2020) [19], a meta-regression is conceptually the same as a traditional (weighted) regression, but instead of individual subjects, the units of analysis are studies. Such meta-regression models allow researchers to directly add continuous and/or categorical covariates into a model to investigate the covariates’ association with the effect size [12]. The remaining variability that is not accounted for by the included covariates is the residual heterogeneity, and respectively. Standard meta-regressions, both additive and multiplicative versions, assume that the amount of residual heterogeneity remains constant across studies. To overcome this assumption and allow heterogeneity to be dependent on covariates and consequently be specific for the studies themselves, Viechtbauer and López-López (2022) [20] suggest to use location-scale meta-regression which further allows researchers to investigate which covariates are associated with the amount of heterogeneity. Model selection to reduce the risk of overfitting can be readily applied for location-scale meta-regression [21].

On the use of meta-analysis methods for the analysis of large-scale RPs.

Methodology from the meta-analysis literature has been used to quantify the reproducibility of findings or assess successful replication. For this, results from the original and replication studies are pooled together using a fixed effect meta-analysis. However, significance of the combined effect estimate as a “replication success” metric has limitations, as it will almost certainly flag success if the original study result is very convincing (small p-value) even if the replication p-value is large [22]. Since there is no universally agreed-up on criterion for replication success, most large-scale replication projects used a whole set of metrics [23]. The RPP, for example, used significance and p-values, effect sizes, subjective assessment of the replication teams, and meta-analyses of effect sizes as metrics for replication success and to compute overall reproducibility rates. Even though replication studies usually follow a strict protocol that adheres as close as possible to the original study, it is unavoidable that slightly varying conditions in the two studies can lead to variability in the underlying true effects. Sources of such within-study-pair heterogeneity are manifold (see for example Table 1 in Bryan et al. (2021) [24] for behavioral intervention research) and include differences in the study population or in the definition of the outcome or intervention. For example, many replication studies sample participants from slightly different populations [11]. Of the previously mentioned large-scale RPs, only the RPCB discussed within-study-pair heterogeneity and accounted for it in sensitivity analyses by re-estimating the agreement in significance across study-pairs under the assumption that there was within-study-pair heterogeneity [5]. To estimate the amount of heterogeneity they had to use preregistered many-lab experiments in Psychology [25], because within-study-pair heterogeneity cannot be estimated in the standard “Many Phenomena, One Study” setup. Many-lab experiments follow a “One Phenomenon, Many Studies” setup, attempting to directly replicate one original finding many times in different labs or groups [6]. A key goal of these many-lab experiments is to examine the heterogeneity of effect size within replication studies [26, 27].

Turning back to large-scale RPs, Pawel and Held (2020) [10] used a model which can take into account shrinkage and within-original-replication-study-pair heterogeneity to predict the replication effect estimate. They concluded that some degree of heterogeneity between original and replication effects should be expected. In a related setting, Röver and Friede (2024) [28] used methods from meta-analysis to investigate heterogeneity in pairs of so called “study twins”, two similar confirmatory clinical phase III trials that are based on a common protocol. They concluded that a single study-pair “provides only very little evidence on the heterogeneity” within the pair.

Summarising the results across all original-replication study-pairs included in the same RP into an overall reproducibility assessment of a research field, as usually done in RPs, is challenged by the presence of between-study-pair heterogeneity. Meta-regression methodology could help to contextualize the results of heterogeneous projects, but has only rarely been used to examine the results of reproducibility efforts. The RPP team used meta-regressions to investigate evidence for publication bias in their data, by first testing whether the original, respectively replication, standard errors are associated with the original, respectively replication, effect sizes, and second, whether the original standard error is associated with the difference in effect sizes [, Supplement Sect 4.g and 4.h]. Bench et al. (2017) [29] reanalysed the results from the RPP and used meta-regression to investigate whether the expertise of the replication team influenced the results of replication studies (i.e., replication effect size). In another reanalysis of the RPP data by van Bavel et al. (2016) [30], regression models were used to examine the association between reproducibility (measured by the binary rating that the original results were successfully replicated) and contextual sensitivity. Altmejd et al. (2019) [31] used linear regression to predict the relative effect size (replication effect size divided by original effect size) using parameters related to the design and properties of the author team of the original and replication studies, with data from RPP, RPEE and two many-lab experiments.

Explaining shrinkage and heterogeneity using location-scale meta-regression.

Loca-tion-scale meta-regressions should be well suited to help comprehend what the primary drivers of shrinkage (i.e., the location) and between-study-pair heterogeneity (i.e., the scale) are. More specifically, when planning future replication projects, it is valuable to know what levels of shrinkage and amount of heterogeneity in effect size differences are to be expected as this directly influences the reproducibility rate estimated in the project, which is often based on effect size differences. When analysing the results from large-scale RPs, researchers might want to perform reproducibility subgroup analyses by estimating the reproducibility separately for those original-replication study-pairs with specific characteristics linked to the location and/or the scale.

The additive version of the (location-scale) meta-regression is more broadly used, but the multiplicative version with its assumption of a single overall effect size, might be particularly well suited in the replication setting, where replication studies are based on study protocols that are very similar to those of the original studies [, Sect 4]. Recently, multiplicative meta-regressions have been employed to explore how design differences are associated with the variation in results between study-pairs of randomised trials and their replications performed with real world data [23]. The difference in effect sizes, which should be close to zero if there were no shrinkage, is used as outcome variable and the multiplicative between-study-pair heterogeneity can be readily extracted from this model.

In the present paper, we use and extend methodology from the meta-analysis literature to study shrinkage of effect size and heterogeneity in effect size differences between original-replication study-pairs in the research fields Psychology and Experimental Economics. Location-scale meta-regression will help identify potential sources of shrinkage and heterogeneity, and allow for a more nuanced conclusion on the reproducibility of the research published in those fields. The proposed methodology is presented in the following section and illustrated in a case study. We close with a discussion of the results and the limitations.

Methods

Statistical assessment of large-scale replication projects

A replication project is composed of n independent original-replication study-pairs. The difference in effect size between an original study i and its replication – the outcome of interest – is defined as , where and are the underlying effects, estimated by and , in either the original study or the replication study. Note that and the difference is estimated by . The effect size can be a mean difference, a log odds ratio, a correlation or similar and might have to be transformed to follow approximately a normal distribution [32]. The effect type of all n study-pairs is assumed to be the same, by default or after transformation, and the original effects are all oriented to be positive. The standard errors are denoted by and respectively. Original and replication study are assumed to be two independent studies, each collecting its own data, and the standard error of the effect size difference will be . An important quantity in the assessment of replication projects is the standardized difference [33], which measures compatibility of effect sizes:

(1)

Under the null hypothesis H0 of homogeneity between studies, follows a standard normal distribution. The squared standardized difference , which is often refered to as the Q-statistic, follows a under H0. Cochran’s Q-test for heterogeneity between two studies uses this last property [34, 35]. In the replication setting, the Q-test measures the evidence that the observed differences between original and replication studies are due to more than just random variation. For an overall test of heterogeneity in a replication project, encompassing n independent study-pairs, the following test statistic and property can be used:

(2)

As discussed above, between-study-pair heterogeneity can be quantified with a multiplicative or additive variance inflation parameter [12]. For the multiplicative heterogeneity parameter , the differences for all study-pairs i are assumed to be independently distributed, such that

(3)

A weighted linear regression model relating the estimated differences to a constant is considered and is estimated as the mean squared error of the model [17, 36]. The weights are equal to the inverse of the variance of the difference , , ensuring that more precise study-pairs have more influence in the analysis. Further, and in the absence of heterogeneity . In practice, if is estimated to be smaller than 1, it will be set to 1, as suggested by Mawdsley et al. (2016) [17]. To quantify heterogeneity with an additive variance inflation parameter ,

(4)

a random effects model is considered, with weights assigned to each study-pair . The heterogeneity variance is then estimated as the between-study-pair variance from this random-effects model, implemented in the metafor R-package [16]. In the absence of heterogeneity, and otherwise. The results of both model versions for heterogeneity can be used to compute a prediction interval around the predicted difference in effect size for a new study-pair [20]. The choice between model versions (multiplicative vs additive) depends on the assumptions made about the underlying effect sizes. The multiplicative model assumes a single, common true effect size across study-pairs, whereas the additive model allows for variability in the true effect sizes across study-pairs.

Location-scale models in replication projects

To investigate sources of shrinkage (i.e., differences) and effect size difference variability between study-pairs (i.e., heterogeneity), meta-regression models for the location and scale will be used. The dependent variable in the regression will be the difference in effect size estimated for each study-pair and the units of analysis are the study-pairs i. Given a set of p candidate covariates, , with values for study-pair i, treating heterogeneity as a multiplicative parameter leads to the following model,

(5)

where denotes the residual multiplicative heterogeneity parameter, i.e., the between-study-pair heterogeneity that remains after correcting for the p covariates, is the model coefficient for covariate xj, and is the model’s intercept. For the additive version of heterogeneity, the model is adapted to

(6)

with being the residual additive heterogeneity variance and being the intercept and coefficients of the additive model version. Cinar et al. (2021) [21] define the residual heterogeneity as the “variability in the true outcomes not accounted for by the moderators included in the model”. The covariates are refered to as location covariates and , respectively , are the location coefficients, as they stem from an investigation of the covariates’ relationship or effect on the location, i.e., the size of the outcome.

As appropriate covariate to explain the amount of shrinkage of effect size, the RPP used the standard error of the original study [, in Supplement Sect 4.h]. They interpret a positive effect as “imprecise original studies (large standard error) yielding larger differences in effect size between original and replication study” and directly relate a positive effect to evidence for publication bias. Traditionally, the outcome in Egger’s regression test [37] is the effect size of each single study. Under the common assumption, that the replication studies are rigorously planned and follow a strict study protocol, and are therefore unbiased, a positive effect of the original standard error on the difference between original and replication study effect sizes can still be related to bias in the original studies.

Location-scale models directly relate covariates to the amount of heterogeneity in the outcome [20]. The scale refers to the heterogeneity, which is now specific to the study-pair. For this, another set of q covariates with values for the ith study-pair are introduced, and refered to as “scale covariates”. The residual multiplicative heterogeneity parameter , respectively the residual additive heterogeneity variance , for original-replication-study-pair i are defined as

(7)(8)

where the and are the respective intercepts, and and are the scale coefficients for the scale covariates zk. The log link ensures that the resulting heterogeneity parameters are positive, while additionally is forced to be larger or equal to 1. The models defined in Eqs 5 and 6 are special cases of the location-scale models in Eqs 7 and 8 with or . The location-scale model with additive heterogeneity is implemented in the R-package metafor [16]. To compute the location-scale model with multiplicative heterogeneity parameter, generalized additive models for the location, scale and shape, as implemented in the R-package gamlss [38], are used with an offset to incorporate weighting. Specifically, combining Eqs 3 and 7, with , gives leading to an offset of in the formula for the scale.

Model selection.

As shown in Cinar et al. (2021) [21], information-theoretic approaches can be used for both, location only and location-scale models. More specifically, to select the most important location covariates, all candidate meta-regression models for the location are considered and their Akaike information criterion (AIC) is computed [39]. AIC is based on the log likelihood l and is defined as , where k is the total number of model parameters, the number (p) of location covariates plus two (intercept and the heterogeneity variance or heterogeneity parameter ). Note that the likelihood l can be estimated either with maximum likelihood (ML) or restricted maximum likelihood (REML). As shown through simulation studies presented in Cinar et al. (2021) [21] and Viechtbauer and López-López [20], when selecting among candidate models the REML estimation outperformed the ML estimation. Henceforth, we will use REML estimation. The final meta-regression model for the location is the one minimising AIC, including the best location covariates. To select the best performing scale covariates among the q candidates, the same selection procedure based on AIC can be used with the exception that k in the definition of the AIC is now equal to the number (p) of location covariates plus the number (q) of scale covariates plus two (the location intercept and the scale intercept). The same procedure is followed for models with multiplicative and additive heterogeneity.

Case study: Sources of shrinkage and heterogeneity in psychology and experimental economics

To illustrate the applicability of the proposed methodology, we reanalyse data from the replication projects in Psychology [RPP, ] and in Experimental Economics [RPEE, ] provided by Altmejd et al. (2019) [31]. Note that this secondary data analysis is of exploratory rather than confirmatory nature, without any preregistration. The conclusions drawn from our analysis might help formulate new hypotheses to be tested in a subsequent confirmatory study with data collected for purpose.

Data source and description

Using machine learning models, Altmejd et al. (2019) [31] aimed at predicting reproducibility in two large-scale replication projects (RPP and RPEE) and two many-lab experiments. As outcome measure they used a binary criterion for successful replication defined as a replication with significant effect (two-sided p-value ) in the same direction as the original study. Additionally, they attempted to predict the relative effect size, i.e., the ratio of replication and original effect sizes, which were standardized to correlation coefficients. The covariates or features used in the machine learning models were divided in two classes: features related to the statistical design properties and outcomes, and features related to the descriptive aspects of the original and replication study, including the citation count of the original articles or the past success of the authors. For our case study, we will employ only RPP and RPEE, since we are interested in between-study-pair heterogeneity which cannot be investigated in many-lab experiments. The data was downloaded from the Open Science Framework (OSF, https://osf.io/4fn73/). As explained in the data analysis protocol in our Appendix, the data provided by Altmejd et al. (2019) [31] was merged with the data on the same replication projects from the R package ReplicationSuccess [40]. For our analysis we need standard errors of the effect sizes and are therefore forced to use the so-called meta-analytic subset for which both the z-transformed correlation coefficient and its standard error could be computed (73 of the 100 RPP study-pairs and 18 of the 18 RPEE). Further, Altmejd et al. (2019) [31] included only original studies with an effect interpreted as significant by the original authors (three of the original studies included from the RPP were non-significant). One more study from the RPP was excluded due to too many missing values in the covariates. In total 87 original-replication study-pairs are included in our case study; 69 from the RPP and 18 from the RPEE.

While the machine learning models used all the covariates without any transformation or selection, we base our initial selection on subject knowledge to avoid overfitting given the relatively small sample size and apply log-transformations on count variables. Additionally, we use only information from the original study and information from the replication study that was defined prior to the replication being conducted, which ensures our models remain useful for prediction. In total nine continuous and five categorical covariates were selected as candidates. We refer to our Appendix for a description of the available data and covariates and, specifically, to Table A.3 and Table A.4 for a summary of the selected candidate covariates and applied transformations.

Fig 1 shows the difference in effect size for all study-pairs in both replication projects with their 95% confidence intervals. Most replication attempts show high levels of shrinkage of the effect estimate [9]: compared to the original effect size a smaller effect size is observed in the corresponding replication and . Among all original-replication study-pairs only 14.9% have a negative difference.

thumbnail
Fig 1. Ordered differences between effect estimates on Fisher-z scale for all included study-pairs (n = 87) in both replication projects with their 95% confidence interval.

The dashed horizontal line indicates no difference in effect size.

https://doi.org/10.1371/journal.pone.0327799.g001

Evidence for shrinkage and between-study-pair heterogeneity

The observed standardized differences in Eq (1) are computed for all study-pairs. They are shown in the left panel of Fig 2 against a standard normal distribution. In the absence of shrinkage and heterogeneity, the distribution of the and the standard normal distribution would be aligned. However, the distribution of the standardized differences is shifted towards positive values, indicating shrinkage. The right panel of Fig 2 shows the p-values resulting from a Q-test for heterogeneity within each individual original-replication-study-pair. In the absence of shrinkage and heterogeneity, the latter would be uniformly distributed, but too many small p-values are observed. An overall test for between-study-pair heterogeneity as described in Eq (2) suggests that heterogeneity cannot be disregarded (overall p-value < 0.0001).

thumbnail
Fig 2. Left: The distribution of the observed standardized difference of the original-replication-study-pair (n = 87) compared to the standard normal distribution.

Right: The p-values from the Q-test for heterogeneity within original-replication study-pairs, as well as the p-value for the overall test for heterogeneity between all study-pairs, included in both replication projects.

https://doi.org/10.1371/journal.pone.0327799.g002

The first column of Table 1 shows multiplicative and additive heterogeneity, extracted from the unadjusted models in Eqs (3) and (4) respectively. The estimates of the model intercept can be interpreted as the estimated overall difference in the original-replication comparison. This overall difference is positive, suggesting that, on average the original effect estimates are larger than the effect estimates from the replications, i.e., evidence for shrinkage of effect size. The estimates of the unadjusted heterogeneity presented in Table 1 are the baseline estimates of heterogeneity to be reduced and explained with location-scale meta-regressions. More specifically, using the additive model, the difference in effect size is estimated to be 0.21 with 95% confidence interval from 0.16 to 0.26. The between study-pair variance is estimated to be which leads to a 95% prediction interval for the difference in effect size of -0.08 to 0.5 for the difference in effect size for a new study-pair, revealing substantial heterogeneity in the difference between study-pairs.

thumbnail
Table 1. Summary of investigated location meta-regression models with multiplicative and additive heterogeneity. The location coefficient estimates are shown with their 95% confidence intervals (95%CI) and the residual heterogeneity. The first model is the weighted unadjuted meta-regression model relating the difference in effect size to a constant, as in Eqs 3 and 4. For the second model, one covariate was added into the meta-regression as a proof-of-concept. The third model represents the final model selected via the model selection procedure. Additional information on the covariates is added in the note below the table.

https://doi.org/10.1371/journal.pone.0327799.t001

Location-only meta-regression models

We follow the analysis presented in the supplement of the RPP [1] and add the standard error of the original study as first location covariate of interest to be included in the meta-regression. We would expect to see more shrinkage for less precise original results, i.e., larger original standard errors. The second column in Table 1 summarises the results of the multiplicative and additive meta-regressions with the resulting residual heterogeneity variance and residual heterogeneity parameter respectively. The results show a positive association between the original standard error and the difference in effect size. As hypothesized, a larger (a less precise original finding) leads to larger differences, and hence more shrinkage in effect size ( is estimated to be 1.6 for the multiplicative model and to be 1.11 for the additive model). The heterogeneity parameter is reduced from 2.004 to 1.682 for the multiplicative model and from 0.021 to 0.017 for the additive model. The association between the original standard error an the effect size difference is also shown in Fig B.3 in our Appendix with confidence and prediction intervals depending on whether an additive or a multiplicative version of heterogeneity is used.

More covariates are available to further reduce the residual heterogeneity (see Table A.3 and Table A.4 in our Appendix). A total of 4’608 models depending on which of the covariates are included for the location, additionally to the original standard error. The eight continuous covariates are the original paper length defined as the number of pages from the citation information (according to personal communication with A. Altmejd), the log-transformed citation count of the original study, the number of authors in the original and in the replication team, the share of male authors in the original and in the replication team, and the log-transformed average number of citations per author in the original and in the replication team. The five categorical covariates are the discipline, the highest seniority of the authors in the original and replication team, and binary covariates of the original and replication experiment being conducted in the same language or country, and whether they used the same type of subjects. For the sake of interpretation, the continuous covariates informing on the share of male authors in the original or the replication study will only be included in combination with the covariates informing on the total number of authors in the original or the replication study, respectively. For both model types, the AIC of the models with best performance (min AIC) per number of covariates is represented in Fig 3, together with the respective residual heterogeneity. The minimum AIC is found after seven more covariates are added in the multiplicative model and two more covariates are added to the additive model. The last column in Table 1 informs on which of the covariates are included in the respective models; see also the description of the included covariates in the notes. Additionally, the residual heterogeneity is shown. The set of included covariates leads to a substantial decrease in heterogeneity: from 2.004 to 1.212 for the multiplicative heterogeneity parameter and from 0.021 to 0.012 for the additive heterogeneity variance. Table 1 shows the coefficient estimates and their 95% confidence intervals for the final meta-regression models with only location coefficients. Of interest is that, regardless of the version of the model the magnitude of the coefficient estimates are similar for those covariates included in both models. More authors on the original paper increases the risk of shrinkage (the coefficient in the multiplicative model is 0.03 vs 0.04 in the additive model), while a larger share of male authors in the original author list decreases the effect size difference (the coefficient in the multiplicative is –0.19 vs –0.17 in the additive model). The same model selection steps using Bayesian information criterion can be found in our Appendix. The models with smallest BIC are more parsimounous than those with smallest AIC. The number of original authors and the share of male authors on the original papers are judged most important in both model versions, while the selected multiplicative model with BIC also selects the binary covariate informing on the orignal and replication experiments being conducted in the same country. When selecting with AIC, the covariate “O&R same language”, probably highly correlated to “O&R same country” and conveying the same information, was judged more important.

thumbnail
Fig 3. The AIC for the multiplicative and the additive models with best performance (min AIC) for each possible number of covariates included.

The residual multiplicative heterogeneity parameter and additive heterogeneity variance of the respective models are also shown. At least one covariate, namely the original standard error, is included. The minimum AIC value is highlighted.

https://doi.org/10.1371/journal.pone.0327799.g003

Location-scale meta-regression models

For computational simplicity, only those covariates retained to explain the location in the multiplicative, respectively the additive model, are now also employed as candidates to explain the scale. Tables 2 and 3 show which set of location and scale covariates are included in the best ten models according to their AIC, for the multiplicative and additive heterogeneity respectively. The smallest AIC is observed for the multiplicative model versions by dropping one location covariates (number of pages) and adding the scale covariates “O&R same language” and “Citations (log, O)”. For the additive version all location covariates are kept and “Original standard error” is added as scale covariate. The final models are summarised in Table 4. The location coefficient estimates are of similar magnitude as the once presented in Table 1. Turning to the scale covariates, the multiplicative heterogeneity is reduced when the experiments in the original and the replication studies are conducted in the same language, and the original study was cited more. The additive heterogeneity is reduced for less precise original studies. The residual heterogeneity resulting from the two location-scale meta-regression models is not constant, but instead a function of the scale coefficients. For study-pairs with average original citation count (on the log scale) of 4.06, the residual multiplicative heterogeneity 1 for study-pairs using the same language in original and replication experiment, while 1.29 for those study-pairs with experiments conducted in different languages. Similarly, the residual additive heterogeneity is estimated to lie between approximately 0, for maximum original standard errors of 0.58, and close to 0.05 for a minimum original standard errors of 0.04. Graphical model diagnostics of the selected, final location-scale meta-regressions, i.e., normal QQ plots, can be found in Fig C.4 in our Appendix, suggesting that the normality assumption of the data holds.

thumbnail
Table 2. The ten best multiplicative location-scale models according to their AIC and depending on the selection of location and scale covariates. The location and scale covariates are marked with a check-mark if they are present in the model and with a dash if they are not. The last row of the Table shows the AIC of each model.

https://doi.org/10.1371/journal.pone.0327799.t002

thumbnail
Table 3. The ten best additive location-scale models according to their AIC and depending on the selection of location and scale covariates. The location and scale covariates are marked with a check-mark if they are present in the model and with a dash if they are not. The last row of the Table shows the AIC of the specific model.

https://doi.org/10.1371/journal.pone.0327799.t003

thumbnail
Table 4. The final multiplicative and additive location-scale meta-regression models minimising the AIC. The location and scale coefficient estimates are shown together with their 95% confidence intervals (95%CI). Only a subset of the covariates used for the location are kept to explain the scale. Additional information on the covariates is added in the note below the table.

https://doi.org/10.1371/journal.pone.0327799.t004

Discussion

In this paper, we investigated shrinkage of effect size and heterogeneity in the difference in effect sizes between-original-replication study-pairs that are part of the same large-scale replication effort. We focused on quantifying heterogeneity and explored the potential sources of shrinkage and heterogeneity through covariates associated with study design and demographic factors of the studies and study teams. We suggest to model the differences in effect size between original and replication study with location-scale meta-regression since they allow for a simultaneous investigation of shrinkage and heterogeneity. Traditionally, the unit of analysis in location-scale meta-regressions are individual studies. We extended this methodology to large-scale RPs by modeling the differences in effect sizes between original and replication study and hence, changing the units of analysis to study-pairs. Location covariates link to the amount of shrinkage of effect size, while the scale covariates help explain heterogeneity in the effect size differences. Commonly used model selection criteria can directly be used on the models to reduce their complexity and select the best covariates for both, the location and the scale [21]. Most literature and methods in meta-analysis and -regression to date focus on quantifying heterogeneity as an additive parameter [32], while we also illustrated how a related model utilizing a multiplicative version of heterogeneity can be computed [17]. Indeed, we believe that the multiplicative heterogeneity has a lot of potential in the setting of RPs due to its easy interpretation and its assumption of a single overall effect size, particularly well suited for direct replication studies based on study protocols that are very similar to those of the original studies.

The results from the location-scale meta-regression models presented in our case study could be of interest to authors of future replication efforts. For example, study-pairs where the experiments of the original and replication study were conducted in the same language present lower multiplicative heterogeneity but more shrinkage. A larger number of authors on the original paper or a longer original paper are associated with more shrinkage for both model versions. However, more precision in the original study (smaller original standard errors) and a larger share of male authors in the original author list are associated with less shrinkage. A larger original standard error, in turn, decreases the additive between-study-pair heterogeneity. Such findings could inform the planning of future replication projects as authors would know how much shrinkage or heterogeneity to expect in effect size differences. In a field where for some reason a lot of shrinkage is to be expected, or where effect sizes generally tend to differ in magnitude from one experiment to the next, traditionally used replication success metrics, which are often based on effect size comparisons, will almost certainly conclude low reproducibility. In these cases, findings from location-scale meta-regression models could, at least, help contextualise the low reproducibility rates, specifically when discussing policies and interventions to improve the reproducibility of research. The findings could further influence the selection of replication success metrics as, some metrics can for example penalise different levels of shrinkage [9].

However, the case study we present is purely observational and any associations that were found could further be confounded by other covariates. In order to confidently base replication project design decisions on such results, large-scale replication projects have to start routinely collecting this information and potentially implement location-scale meta-regression analyses. Additionally, covariates related to questionable research practices or biases in the original study have the potential to be highly informative as sources for shrinkage and heterogeneity. A follow-up study could attempt to specifically collect such information and test the hypothesis that questionable research practices and other biases induce shrinkage and heterogeneity. This is however beyond the scope of our project and would involve experts assessing the risk of bias of the original studies [41]. The expert’s risk of bias assessments can then be used to determine whether any remaining effect size differences and heterogeneity exist after bias adjustment [42]. Combining our methodology with a “limit meta-analysis” [43] may constitude an alternative approach to adjust for bias. Note that modeling the heterogeneity using location-scale model is closely related to the framework presented in Holzmeister et al. (2024) [44], where different types of heterogeneity due to differences in study design, population or analysis are isolated and quantified.

Another field of application for the presented methodology is the analysis of heterogeneity across study-pairs of randomised controlled trials (RCTs) and their non-randomised emulations [23]. A meta-regression with only location covariates related to emulation differences was capable of reducing the heterogeneity substantially. Since shrinkage of effect size is less present in emulations of RCTs, location-scale meta-regression with a special focus on the scale might be worth investigating. There are only limited data on large-scale initiatives comparing RCTs and database studies, none of which have been systematically collecting information to be included in location-scale meta-regressions. The data used in Heyard et al. (2024) [23] is from the RCT-DUPLICATE initiative (Randomized, Controlled Trials Duplicated Using Prospective Longitudinal Insurance Claims: Applying Techniques of Epidemiology) who collected some information on emulation differences but only in a post-hoc manner to be used in a descriptive exploration [45]. More efforts emulating RCTs in real-world data are becoming available [46, 47], offering more use cases for our proposed methodology.

Limitations

Our study is not without limitations. When selecting the models with best prediction performance in our case study, we chose to employ the Akaike information criterion which could have influenced the set of covariates in the final models. Other criteria could be used instead, including the Bayesian information criterion (BIC), scoring rules combined with leave-one-out cross-validation as was done in [23]. We repeated the model selection steps using the BIC (see Appendix) which resulted in more parsimounous models. Since model selection was not the major focus of this project, we refrained from exploring this further. Further, the case study used very limited information on the study-pairs included in two major replication projects. The covariates were collected for their simplicity and convenience, and not for their informative value. As mentioned above, better explantory covariates need to be collected in future RPs. We also suspect some of the covariates to be correlated, as for example the binary indicator informing on the original and replication experiment being conducted in the same language and the binary indicator informing on them being conducted in the same country, which might have influenced our results. Other transformations or combinations of the available covariates could also be use. For example, Pawel and Held (2020) [10] showed how the z-statistic of the original study, , is associated with shrinkage of effect size. The small number of included study-pairs is another limitation of our case study. Hence, the results should not be over-interpreted but rather spark follow-up studies that use the proposed methodology with data collected for purpose.

Supporting information

S1 File. This is our appendix.

It includes data descriptives and supplementary tables, figures and analyses.

https://doi.org/10.1371/journal.pone.0327799.s001

(PDF)

References

  1. 1. Open Science Collaboration. Psychology. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716. pmid:26315443
  2. 2. Camerer CF, Dreber A, Forsell E, Ho T-H, Huber J, Johannesson M, et al. Evaluating replicability of laboratory experiments in economics. Science. 2016;351(6280):1433–6. pmid:26940865
  3. 3. Camerer CF, Dreber A, Holzmeister F, Ho T-H, Huber J, Johannesson M, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat Hum Behav. 2018;2(9):637–44. pmid:31346273
  4. 4. Cova F, Strickland B, Abatista A, Allard A, Andow J, Attie M, et al. Estimating the reproducibility of experimental philosophy. Rev Phil Psych. 2018;12(1):9–44.
  5. 5. Errington TM, Mathur M, Soderberg CK, Denis A, Perfito N, Iorns E, et al. Investigating the replicability of preclinical cancer biology. Elife. 2021;10:e71601. pmid:34874005
  6. 6. McShane BB, Tackett JL, Böckenholt U, Gelman A. Large-scale replication projects in contemporary psychological research. Am Statist. 2019;73(sup1):99–105.
  7. 7. National Academies of Sciences, Engineering, and Medicine; Policy and Global Affairs; Committee on Science, Engineering, Medicine, and Public Policy; Board on Research Data and Information; Division on Engineering and Physical Sciences; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Analytics; Division on Earth and Life Studies; Nuclear and Radiation Studies Board; Division of Behavioral and Social Sciences and Education; Committee on National Statistics; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on Reproducibility and Replicability in Science. Reproducibility and replicability in science. Washington, DC: National Academies Press; 2019. https://doi.org/10.17226/25303
  8. 8. Copas JB. Using regression models for prediction: shrinkage and regression to the mean. Stat Methods Med Res. 1997;6(2):167–83. pmid:9261914
  9. 9. Held L, Micheloud C, Pawel S. The assessment of replication success based on relative effect size. Ann Appl Stat. 2022;16(2).
  10. 10. Pawel S, Held L. Probabilistic forecasting of replication studies. PLoS One. 2020;15(4):e0231416. pmid:32320420
  11. 11. Gilbert DT, King G, Pettigrew S, Wilson TD. Comment on “Estimating the reproducibility of psychological science”. Science. 2016;351(6277):1037. pmid:26941311
  12. 12. Thompson SG, Sharp SJ. Explaining heterogeneity in meta-analysis: a comparison of methods. Statist Med. 1999;18(20):2693–708.
  13. 13. Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21(11):1539–58. pmid:12111919
  14. 14. Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A Stat Soc. 2009;172(1):137–59. pmid:19381330
  15. 15. van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat Med. 2002;21(4):589–624. pmid:11836738
  16. 16. Viechtbauer W. Conducting meta-analyses in R with the meta for Package. J Stat Soft. 2010;36(3).
  17. 17. Mawdsley D, Higgins JPT, Sutton AJ, Abrams KR. Accounting for heterogeneity in meta-analysis using a multiplicative model-an empirical study. Res Synth Methods. 2017;8(1):43–52. pmid:27259973
  18. 18. van Aert RCM, Jackson D. A new justification of the Hartung-Knapp method for random-effects meta-analysis based on weighted least squares regression. Res Synth Methods. 2019;10(4):515–27. pmid:31111673
  19. 19. Schmid CH, Stijnen T, White IR. Handbook of meta-analysis. Schmid CH, Stijnen T, White I, editors. Chapman and Hall/CRC; 2020. https://doi.org/10.1201/9781315119403
  20. 20. Viechtbauer W, López-López JA. Location-scale models for meta-analysis. Res Synth Methods. 2022;13(6):697–715. pmid:35439841
  21. 21. Cinar O, Umbanhowar J, Hoeksema JD, Viechtbauer W. Using information-theoretic approaches for model selection in meta-analysis. Res Synth Methods. 2021;12(4):537–56. pmid:33932323
  22. 22. Freuli F, Held L, Heyard R. Replication success under questionable research practices—a simulation study. Statist Sci. 2023;38(4).
  23. 23. Heyard R, Pawel S, Frese J, Voelkl B, Würbel H, McCann S, et al. A scoping review on metrics to quantify reproducibility: a multitude of questions leads to a multitude of metrics. Royal Society Open Science. 2025;12(7).
  24. 24. Bryan CJ, Tipton E, Yeager DS. Behavioural science is unlikely to change the world without a heterogeneity revolution. Nat Hum Behav. 2021;5(8):980–9. pmid:34294901
  25. 25. Olsson-Collentine A, Wicherts JM, van Assen MALM. Heterogeneity in direct replications in psychology and its association with effect size. Psychol Bull. 2020;146(10):922–40. pmid:32700942
  26. 26. Klein RA, Ratliff KA, Vianello M, Adams RB Jr, Bahník Š, Bernstein MJ, et al. Investigating variation in replicability. Soc Psychol. 2014;45(3):142–52.
  27. 27. Klein RA, Vianello M, Hasselman F, Adams BG, Adams RB, Alper S, et al. Many labs 2: investigating variation in replicability across samples and settings. Adv Methods Pract Psychol Sci. 2018;10(4):443–90.
  28. 28. Röver C, Friede T. Investigating the heterogeneity of “Study Twins”. Biom J. 2024;66(6):e202300387. pmid:39223907
  29. 29. Bench SW, Rivera GN, Schlegel RJ, Hicks JA, Lench HC. Does expertise matter in replication? An examination of the reproducibility project: psychology. J Exp Soc Psychol. 2017;68:181–4.
  30. 30. Van Bavel JJ, Mende-Siedlecki P, Brady WJ, Reinero DA. Contextual sensitivity in scientific reproducibility. Proc Natl Acad Sci U S A. 2016;113(23):6454–9. pmid:27217556
  31. 31. Altmejd A, Dreber A, Forsell E, Huber J, Imai T, Johannesson M, et al. Predicting the replicability of social science lab experiments. PLoS One. 2019;14(12):e0225826. pmid:31805105
  32. 32. Cooper H, Hedges LV, Valentine JC. The handbook of research synthesis and meta-analysis. Russell Sage Foundation; 2019. https://doi.org/10.7758/9781610441384
  33. 33. Patil P, Peng RD, Leek JT. What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect Psychol Sci. 2016;11(4):539–44. pmid:27474140
  34. 34. Cochran WG. Problems arising in the analysis of a series of similar experiments. J Roy Statist Soc Ser B: Statist Methodol. 1937;4(1):102–18.
  35. 35. Cochran WG. The combination of estimates from different experiments. Biometrics. 1954;10(1):101.
  36. 36. Stanley TD, Doucouliagos H. Neither fixed nor random: weighted least squares meta-analysis. Stat Med. 2015;34(13):2116–27. pmid:25809462
  37. 37. Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629–34. pmid:9310563
  38. 38. Stasinopoulos MD, Rigby RA, Bastiani FD. GAMLSS: a distributional regression approach. Statist Model. 2018;18(3–4):248–73.
  39. 39. Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr. 1974;19(6):716–23.
  40. 40. Held L, Micheloud C, Pawel S, Gerber F, Hofmann F. Design and analysis of replication studies with replication success. 2022. https://crsuzh.github.io/ReplicationSuccess/index.html
  41. 41. Higgins JP, Savović J, Page MJ, Elbers RG, Sterne JA. Assessing risk of bias in a randomized trial. Cochrane handbook for systematic reviews of interventions. Wiley; 2019. p. 205–28. https://doi.org/10.1002/9781119536604.ch8
  42. 42. Turner RM, Spiegelhalter DJ, Smith GCS, Thompson SG. Bias modelling in evidence synthesis. J R Stat Soc Ser A Stat Soc. 2009;172(1):21–47. pmid:19381328
  43. 43. Rücker G, Schwarzer G, Carpenter JR, Binder H, Schumacher M. Treatment-effect estimates adjusted for small-study effects via a limit meta-analysis. Biostatistics. 2011;12(1):122–42. pmid:20656692
  44. 44. Holzmeister F, Johannesson M, Böhm R, Dreber A, Huber J, Kirchler M. Heterogeneity in effect size estimates. Proc Natl Acad Sci U S A. 2024;121(32):e2403490121. pmid:39078672
  45. 45. Wang SV, Schneeweiss S, RCT-DUPLICATE Initiative, Franklin JM, Desai RJ, Feldman W, et al. Emulation of randomized clinical trials with nonrandomized database analyses: results of 32 clinical trials. JAMA. 2023;329(16):1376–85. pmid:37097356
  46. 46. Antoine A, Pérol D, Robain M, Bachelot T, Choquet R, Jacot W, et al. Assessing the real-world effectiveness of 8 major metastatic breast cancer drugs using target trial emulation. Eur J Cancer. 2024;213:115072. pmid:39476445
  47. 47. Signori A, Ponzano M, Kalincik T, Ozakbas S, Horakova D, Kubala Havrdova E, et al. Emulating randomised clinical trials in relapsing-remitting multiple sclerosis with non-randomised real-world evidence: an application using data from the MSBase registry. J Neurol Neurosurg Psychiatry. 2024;95(7):620–5. pmid:38242680