Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Publication bias in psychology: A closer look at the correlation between sample size and effect size

Abstract

Previously observed negative correlations between sample size and effect size (n-ES correlation) in psychological research have been interpreted as evidence for publication bias and related undesirable biases. Here, we present two studies aimed at better understanding to what extent negative n-ES correlations reflect such biases or might be explained by unproblematic adjustments of sample size to expected effect sizes. In Study 1, we analysed n-ES correlations in 150 meta-analyses from cognitive, organizational, and social psychology and in 57 multiple replications, which are free from relevant biases. In Study 2, we used a random sample of 160 psychology papers to compare the n-ES correlation for effects that are central to these papers and effects selected at random from these papers. n-ES correlations proved inconspicuous in meta-analyses. In line with previous research, they do not suggest that publication bias and related biases have a strong impact on meta-analyses in psychology. A much higher n-ES correlation emerged for publications’ focal effects. To what extent this should be attributed to publication bias and related biases remains unclear.

Introduction

A spectre is haunting psychology–the spectre of bias. The erroneous belief that statistically non-significant findings are uninformative incentivises researchers to publish statistically significant findings [1,2]. As a consequence, researchers might selectively report those analyses and outcomes that turn out statistically significant, and they might keep their statistically non-significant studies in the file drawer [3,4]. These biases, collectively known as publication selection bias (PSB), cause a problematic inflation of favourable evidence in the published literature [5]. As a consequence, treatments might be less effective than believed and PhD students and researchers might waste their time investigating imaginary effects.

PSB is a concern across many disciplines [6,7]. Multiple lines of evidence indicate that psychology is affected too. Thus, effect sizes are often substantially smaller: in unpublished than in published studies [8,9]; in replications than in the original studies being replicated [10]; and in studies with some pre-registration of hypotheses, data collection methods, and analyses (all of which limit PSB) than in studies without pre-registration [11]. Also, registered reports (which preclude PSB because details of data collection, reporting standards, and publication are agreed before the study commences) find evidence in favour of their central hypothesis much less frequently than conventional studies do [12]. Moreover, PSB is often suggested by various techniques developed to detect its prevalence in meta-analyses [8,13].

In light of this evidence for PSB and its negative consequences in psychology, an indicator would be desirable to reflect how serious the problem is and, perhaps more importantly, whether PSB reduces over time, e.g., due to the effectiveness of proposed counter-measures such as study pre-registration [14]. As we shall describe in greater detail below, the indicators that we have discussed so far (e.g., comparison of results across studies that are more or less prone to PSB) show serious limitations for these purposes. More suitable towards these ends might be the correlation between studies’ sample size n and their effect size (henceforth n-ES correlation), as we argue in detail below. Here, we present two studies that aim to better understand the validity of the n-ES correlation as an indicator of PSB in psychology.

Measures that proved valuable for flagging PSB as a potential problem might be less suitable to indicate how widespread the problem is or how PSB changes (or fails to change) over time. Effect size comparisons between published and unpublished studies are hampered by the fact that the latter are difficult to obtain without bias [8]. Effect size comparisons between original studies and their replications are limited by the relatively small number of replications and the lack of representativeness in the studies chosen for replication (e.g., a short online study that shows astonishing results will be more likely to be replicated than an arduous clinical trial that finds a small effect). Given the relative novelty of pre-registered studies and registered reports, analyses reliant on them have limited value for studying change over time. Finally, techniques that have been developed to uncover PSB within meta-analyses often disagree in their conclusions, suffer from low statistical power, and generally struggle in the face of effect size heterogeneity (i.e., when the true magnitude of the effect under investigation varies across studies), which is almost ubiquitous [1517].

The n-ES correlation is an alternative indicator of PSB, which avoids these problems. Its logic is best illustrated when we imagine a set of studies that investigate the same effect in the same way but differ in their ns. Let us assume that all studies with p < .05 get published and all studies with p ≥ .05 get rejected. Across studies, the observed effect sizes fluctuate symmetrically around the true population effect size (with this fluctuation being stronger in smaller studies than in larger studies, see funnel plot in Fig 1). Whether a study’s p-value turns out low enough to result in publication hinges on two factors: the study’s n (ceteris paribus, larger ns result in smaller p-values) and the study’s observed effect size (ceteris paribus, larger observed effect sizes result in smaller p-values). Consequently, the threshold for the smallest observed effect size that satisfies p < .05 decreases as n increases; in the subgroup of published studies, a negative n-ES correlation therefore emerges, which is absent in the complete set of studies (see Fig 1). Consequently, a negative n-ES correlation might indicate PSB. (Our example only considered publication being contingent on statistical significance, whereas PSB encompasses additional biases such as selective reporting of outcomes and analyses [4]. These additional biases, however, contribute to a negative n-ES correlation for similar reasons.)

thumbnail
Fig 1. Observed effect sizes for 100 simulated studies scatter symmetrically around the true population effect size d = 0.2.

Studies with larger N have smaller standard errors and are therefore located towards the top. Only study results within the grey area are statistically significant (p < .05). If only these studies (white) are published, a correlation between sample size and effect size emerges (here, r = -.59), which is absent for the complete set of studies.

https://doi.org/10.1371/journal.pone.0297075.g001

The n-ES correlation avoids the problems discussed for other PSB indicators above. This can be illustrated by an influential survey that looked at a random sample of almost 400 psychology papers from 2007 and found a strong negative n-ES correlation [18]. Hereafter, we refer to samples of this type that compile data across a wide range of topics as cross-topics samples. For cross-topics samples, statistical power is not a concern because researchers can compile as large a sample of studies as is required. Also, random sampling of studies, which guarantees the representativeness of the sample for the target population, is easy to achieve. Effect size heterogeneity, however, remains a potential problem in cross-topics samples because innocuous factors other than PSB can lead to a negative n-ES correlation. A particular concern is that researchers have some understanding of the magnitude of the effect they are studying and adjust their sample size accordingly, i.e., use larger samples to study small effects and smaller samples to study large effects. This will lead to a negative n-ES correlation even in the absence of PSB, and we shall refer to this as sample-size adjustment. Although, sample size calculations are infrequent in psychology [18], this does not invalidate concerns over sample-size adjustment because researchers might have tacit knowledge about which n suffices in their field. Consequently, it is unclear to what extent the strong negative n-ES correlation found by [18] represents PSB or sample-size adjustment. In Study 1, we sought to explore this issue. We did so by analysing the n-ES correlation under a range of circumstances that differ in key aspects. Given the complexity of the issue, some readers might find Table 2 a helpful companion to the detailed account that follows below.

Aims

The first aim of Study 1 was to investigate the n-ES correlation under circumstances that make sample-size adjustment unlikely. We did this by computing the n-ES correlation for the studies combined within the same meta-analysis (henceforth, within meta-analyses) E.g., this would be r = -.59 for a fictitious meta-analysis of the white studies in Fig 1. What drives the difference in effect sizes among studies that are combined in a meta-analysis, remains typically unclear [15,19]. This suggests that researchers are unable to predict if their effect will turn out small, average, or large compared to other studies of the same topic. Consequently, diverse investigations of the same topic should be based on the same expectation of effect size, and this should largely eliminate sample-size adjustment within meta-analyses. (There are further reasons why a negative n-ES correlation might arise in the absence of PSB [20]. However, these reasons concern specific characteristics of medical trials that rarely apply to psychological research, which is why we do not pursue this point further.)

In order to judge if the observed (average) r-ES correlation within meta-analyses is indicative of PSB, it is important to know what to expected in the absence of PSB. Intuitively, r = .00 appears correct. However, even without PSB, negative n-ES correlations might arise (see Fig 2).

thumbnail
Fig 2. Fictitious results for a set of 100 multiple replications of the same study.

Reflecting the absence of PSB, the n-ES correlation is r = .00 in the left panel (whereby squares/circles depict negative/positive effect sizes). However, the n-ES correlation is typically based on unsigned effect sizes [18]. Once these are used (filled elements in right panel, whereby squares indicate results with changed sign), the n-ES correlation changes to r = -.07.

https://doi.org/10.1371/journal.pone.0297075.g002

Consequently, it would be helpful to compare the n-ES correlation within-meta-analyses (where PBS might be an issue) against data that are free from PBS. Many-Labs replications and Registered Replication Reports (hereafter multiple replications) present such an opportunity. Multiple replications use standardised procedures to replicate original studies across multiple sites (e.g., [21,22]. Because any set of replications addresses the same original study, sample-size adjustment cannot be an issue. Additionally, because multiple replications are pre-registered, PSB can be expected to be absent, too. We therefore determined the n-ES correlation within each set of multiple replications to obtain a PSB-free comparison standard for the n-ES correlation observed within meta-analyses. If we were to find a stronger n-ES correlation within-meta-analyses than within multiple replications, this would suggest PSB within meta-analyses. Conversely, if the (average) n-ES correlation turned out to be similar within meta-analyses and within multiple-replications, this would suggest the absence of PSB in meta-analyses.

The second aim of Study 1 was to explore evidence for sample-size adjustment across topics and its impact on the n-ES correlation in cross-topics samples. Plausibly, researchers use relatively small samples to investigate topics that typically produce strong effects and relatively large samples for topics that typically produce weak effects. As we discussed earlier, this could explain the negative n-ES correlation in cross-topics samples [18], even in the absence of PSB. The average effect size in a meta-analysis reflects the typical strengths of effect sizes for the topic under investigation. We therefore correlated meta-analyses’ average effect size with their average sample size. If this n-ES correlation between meta-analyses was negative, this would indicate sample-size adjustment across topics. In this case, the n-ES correlation can be expected to be stronger in cross-topics samples than within meta-analyses, because sample-size adjustment is implausible in the latter (see also Table 2). To explore this idea was the third aim of Study 1. Finally, this study provided an opportunity to further examine the distribution of empirical effect sizes. A previous study [23] evaluated 12,170 correlation coefficients and 6,447 Cohen’s d statistics extracted from studies included in 134 published meta-analyses. In their terminology the 25th, 50th and 75th percentiles are labelled small, medium, large and they contrasted these with Cohen’s guidelines [24,25]. A survey [23] found that these empirical values were considerably lower than Cohen’s guidelines (d = 0.15/0.36/0.65 instead of 0.20/0.50/0.80 for small, medium, and large effect sizes, respectively). In a sample of 150 meta-analyses, we compare our empirical estimates for small, medium, and large effect sizes to previous findings [23].

Study 1

Methods

Samples.

Our analyses require meta-analyses that report sample sizes and effect sizes for their primary studies. We used a compilation of such meta-analyses, 50 each for cognitive psychology, organizational psychology, and social psychology [15]. From the same source, we took 57 multiple replications as a comparison standard [21,22,2630]. Following [18], the signed effect sizes were recoded as unsigned Cohen’s d. The datasets are described in full in [15].

Data analysis

To obtain results within meta-analyses and within multiple-replications results, we computed the n-ES correlation as Pearson’s r for each of the meta-analyses and for each set of multiple replications. To facilitate comparisons with [18], we also calculated Spearman ρ (rS). Where relevant we used bootstrapping for comparing groups (10,000 bootstraps) [31]. All analyses were conducted in R 4.2.1 [32]. The data and analysis document (including additional analyses and robustness checks) can be found at https://osf.io/ce6v3/?view_only=86b6b997ca52430898a6a2bdb38cf9bb.

Results

Small, medium, large effect sizes based on replication studies and meta-analyses.

Following [23], we examined the 25th, 50th and 75th percentiles and labelled these small, medium, large effect sizes (Cohen d). For multiple replications, the values corresponding to small, medium, and large effect sizes were 0.12, 0.33 and 0.86, respectively.

For all meta-analyses, values for small, medium, large effect sizes were 0.18, 0.42 and 0.77. Dividing these by discipline, showed some minor variations. For cognitive psychology, the corresponding values were 0.24, 0.50 and 0.90. For organizational psychology, the corresponding values were comparable to those of cognitive psychology: 0.22, 0.45 and 0.80. For social psychology, the corresponding values were notably smaller than the two other disciplines: 0.13, 0.33 and 0.64.

The n-ES correlation in the absence of sample-size adjustment: Within meta-analyses and multiple replications.

Addressing our first aim, we first focus on the n-ES correlation within meta-analyses and multiple replications. As discussed earlier, both should be unaffected by sample-size adjustment. Additionally, multiple replications are also unaffected by PSB (see also Table 2). Descriptive statistics are presented in Table 1.

thumbnail
Table 1. Descriptive statistics for the correlation between sample size and effect size within meta-analyses (MAs) and multiple replications.

95% CI based on bootstrap.

https://doi.org/10.1371/journal.pone.0297075.t001

As can be seen, r and rS produced very similar results for the n-ES correlation within meta-analyses (Table 1). Likewise, whether the average correlation was expressed as mean or median hardly affected results. In the remainder, we follow [18] and focus on rS; for consistency with subsequent analyses, we describe the average n-ES correlation via the median. Consistently throughout domains, negative n-ES correlations emerged, with averages ranging from small to small-to-medium in strength. All median n-ES correlations differed statistically significantly from zero (because the confidence intervals excluded zero, see Table 1). Interestingly, we found the same n-ES correlation within meta-analyses and multiple replications, median rS = -.16 (see also Table 2, which summarises key results). As discussed earlier, this similarity would be expected if meta-analyses are unaffected from PSB.

thumbnail
Table 2. Study types, their characteristics, and key results.

https://doi.org/10.1371/journal.pone.0297075.t002

As discussed earlier, the negative n-ES correlation likely arises from our reliance on unsigned effect sizes (see Fig 2). In order to test this explanation, we re-ran the n-ES correlations within multiple replications. This time, however, we used signed effect sizes within each set of multiple replications (akin to the left panel in Fig 2). In line with our explanation, the median n-ES correlation fell to rS = -.01 [-.07, .08]. The median for the signed n-ES correlation within meta-analyses was rS = -.04 [-.10, -.0003]. Albeit statistically significant, because the 95% CI excludes zero, this correlation is very small.

Exploring sample-size adjustment: The n-ES correlation between meta-analyses

Addressing our second aim, we investigated sample-size adjustment and checked if studies on topics that tend to produce relatively large effect sizes tend to have relatively small n. We therefore examined the correlation between meta-analyses’ average n and average effect size. Within meta-analyses, average n tended to be strongly right-skewed (Mdnskewness = 2.38); average effect size also tended to be right-skewed, but to a lesser extent (Mdnskewness = 1.05). For each meta-analysis, we therefore expressed its average effect size via the mean and its average n via both its mean and its median. We then ran two sets of analyses, one based on mean n, and one based on median n. Both led to very similar results and identical conclusions. Here, we report the analyses based on median n.

The scatterplot for the relationship between meta-analyses’ average n and average effect size showed strong outliers (see Fig 3). Consequently, we focussed on rS, which resulted in a small-to-medium negative correlation (rS = -.24, p = .003). (Note that the same correlation was statistically nonsignificant when expressed as r; r = -.14, p = .082.) As discussed earlier, this pattern is indicative of sample-size adjustment across topics.

thumbnail
Fig 3. Relationship between meta-analyses’ average effect size (absolute Cohen’s d) and their average sample size (n, on a log scale).

https://doi.org/10.1371/journal.pone.0297075.g003

Comparing the n-ES correlation within meta-analyses and cross-topics.

Irrespective of PSB, sample size adjustment (which is plausible cross topics but not within meta-analyses) should lead to a higher n-ES correlation in cross-topics samples than within meta-analyses. Our previous analysis investigated the n-ES correlation cross topics, but at a high level of aggregation (meta-analyses’ average effect size and average n). This precludes a sensible comparison with our earlier results regarding the n-ES correlation within meta-analyses, which was investigated at a more granular study level. To enable such a comparison, we pooled all primary studies across our 150 meta-analyses, treated them as a single cross-topics sample, and computed a single n-ES correlation as [18] did.

The 150 meta-analyses comprised altogether 7,227 primary effect sizes and sample sizes. Right-skew was observed for d (4.3) and particularly for n (78.2). Medians (M, SD) were 0.42 (0.57, 0.62) and 100 (438, 8131), respectively. The n-ES correlation across topics was rS = -.23, 95% CI [-.21, -.25], only slightly higher than our average n-ES correlation within meta-analyses (median rS = -.16). This suggests that the effect of sample-size adjustment on the n-ES correlation in cross-topics samples is modest.

A previous study [18] found a much stronger n-ES correlation in their cross-topics sample (rS = -.45) To facilitate comparisons, we computed an estimated 95% CI [34], which was [-.36, -.53]. This differs markedly from the CI for our n-ES correlation across topics [-.21, -.25]. Consequently, sampling error cannot easily account for the stark difference in the n-ES correlation in our cross-topics sample and in [18].

Discussion

In a sample of 150 meta-analyses, we found, on average, a fairly small negative n-ES correlation (mean r = -.13). This is virtually the same as the mean n-ES correlation of r = -.16 previously observed, with the same methods, in another sample of 75 psychology meta-analyses [33]. The authors interpreted their result as evidence that meta-analyses are frequently affected by publication bias. Our results offer a different perspective. We found the same negative n-ES correlation in multiple replications, which are free from publication bias (and PSB, more generally), and we showed that the negative n-ES correlation is mostly a statistical artifact that arises from using unsigned effect sizes. Therefore, findings regarding n-ES correlations within meta-analyses offer, in our view, little evidence for PSB in psychology meta-analyses.

Previously, a much stronger n-ES correlation (rS = -.45) was observed in a cross-topics sample [18]. These authors, too, interpreted their finding as evidence for pervasive publication bias. As we argued in the introduction, their n-ES correlation might reflect (problematic) PSB, (innocuous) sample-size adjustment, or both. Our analyses found evidence for sample-size adjustment; studies that investigated stronger effects (as indicated by the overall meta-analytic effect size) tended to rely on smaller sample sizes than studies that investigated weaker effects. However, sample-size adjustment cannot fully explain the gap in the n-ES correlation within meta-analyses vs. across topics: When we combined all meta-analyses into one large cross-topics sample, our n-ES correlation (rS = -.23) remained much smaller than reported previously (see also Table 2).

Why might this be the case? The previous cross-topics sample took effect sizes from findings that directly addressed the main research question of the respective publication [18]. In contrast, meta-analyses include any pertinent result, regardless of whether it was focal or peripheral to the study it emerges from. Plausibly, PSB might be stronger for results that are focal to a study and weaker or absent for results that are peripheral. This could explain the difference between the small n-ES correlation in our cross-topics analysis of meta-analyses and the previous cross-topics sample of focal findings.

Study 2

The aim of Study 2 was therefore to compare the n-ES correlation between focal effect sizes (i.e., those that address the study’s central hypothesis or aim) and random effect sizes in a cross-topics sample.

As we explain in this section, such a comparison should take the design of the study (between- versus within-subjects) into account. In a within-subjects design, there are two ways to translate the difference between two means into a standardised effect size (e.g., [35]). This difference can be standardised with the pooled standard deviation across the two conditions. This is the same type of effect size that arises from between-subject designs (henceforth, ESbetween). Alternatively, the difference between means can be standardised with the standard deviation for participants’ change scores; this approach typically results in a larger effect size (henceforth, ESwithin). Especially when participants’ scores correlate strongly across conditions (e.g., because the treatment effect is very homogenous across participants), ESwithin can be much larger than ESbetween.

In surveys that investigate the n-ES correlation, effect sizes from within-subjects designs will often be of the ESwithin type because information to compute ESbetween is lacking from the primary study. (If effect sizes are taken from a meta-analysis, its authors might have chosen to compute ESwithin.) At the same time, within-subjects designs have greater statistical power than between-subjects designs, leading researchers to choose a relatively small n. Consequently, it can be expected that ESwithin, compared against ESbetween, tends to be both large and associated with small n. This would, similar to Simpson’s paradox [36], negatively bias the n-ES correlation without being indicative of PSB (see Fig 4). For this reason, it is worthwhile to take the design of the study into account.

thumbnail
Fig 4. Funnel plot for hypothetical results from four within-subjects studies (ws) and four between-subjects studies (bs).

Overall, a strong n-ES correlation emerges, although this correlation is zero for both types of study design.

https://doi.org/10.1371/journal.pone.0297075.g004

Method

Power analysis.

We sought 90% power to identify a difference between two dependent correlations, r = -.45 and r = -.16, via a two-tailed test with α = .05. Our power analysis in G*Power [37] suggested a minimum sample size of n = 157. We decided to use a sample of 160 papers. (The Open Science Framework page for this paper contains alternative power analyses with varying assumptions. In all cases, n ~ 150 appeared sensible for correlations with dependency, https://osf.io/ce6v3/?view_only=86b6b997ca52430898a6a2bdb38cf9bb.)

Eligibility criteria and sampling of papers.

To be suitable for our study, psychology papers needed to fulfil the following eligibility criteria: present original data; use inferential statistics to address the main research question; provide sufficient information to calculate relevant effect sizes; present n. We excluded papers that focused on inferential analyses for which there is no straightforward unitary effect size (multilevel models, structural equation models, time series models, cluster analysis, social network analysis, multidimensional scaling, statistical simulation models, machine learning models, exploratory factor analysis and principal component analysis).

To sample 160 papers, we (somewhat arbitrarily) decided to draw 16 papers for each year from 2012–2021. In particular, we searched for “the” in All Fields in Web of Science, restricted by target year. To focus on psychology papers, we used Web of Science Category and retained only those categories that start with “psychology”. From the resulting list of hits, we selected a paper with the help of a random number generator. If the paper fulfilled our eligibility criteria, it was retained; otherwise, we moved down the list until a suitable paper was found. This process was repeated with new random numbers until all 16 papers for that year were retained. The same process was then repeated for all years.

Selection and coding of focal and random effect size.

For each paper, we extracted two effect sizes, one focal and one random, as well as the sample size associated with each. The focal effect size directly addressed what the paper presented as its main hypothesis or aim. If the paper presented multiple hypotheses/aims as equally important, we used the one mentioned first in its hypotheses/aims section. One author (JH) identified the focal aim/hypothesis for all papers without knowledge of their analyses and results. In cases in which the paper later proved to have no effect size information for this aim/hypothesis, we moved to its next aim/hypothesis. Where multiple outcome variables, samples, or analyses were relevant for the focal effect size, we used whichever occurred first (either in a table or in text) in the results section.

In each paper, we choose a second effect size at random. (By chance, this sometimes happened to be the same as the focal effect size.) We selected a page via a random number generator. We coded the first effect size information on that page that originated from the paper’s study. For this purpose, we read any tables line-by-line, not column-by-column. If the page did not contain relevant effect size information, we repeated the process as required.

All effect sizes were coded as unsigned d. We used various online calculators to convert descriptive statistics, effect sizes (e.g., η2, R2, r, rS, odds ratios), and various test statistics (e.g., F-value, t-value, χ2) into d. Details on extraction and conversion are provided in our pre-registration document on the OSF.

All coding of effect sizes and sample sizes was done by one author (JH). To check reliability, we selected 40 papers at random. Based on the identified focal aim/hypothesis, a second author (AL or TP) independently coded the focal effect size and its associated sample size. Again, both d and n proved strongly right skewed (see Table 3), which is why we computed rS. Correlations between first and second coder proved satisfactory, with rS = .74 for d and rS = .97 for n.

thumbnail
Table 3. Descriptive statistics for focal and random effect sizes and their associated sample sizes in Study 2.

https://doi.org/10.1371/journal.pone.0297075.t003

Analytical strategy

The design of the study was pre-registered, and the analyses were conducted in R 4.2.1 [38]. We preregistered the comparison of the n-ES correlation between focal and randomly selected effects based on Pearson r. For this, we used Zou’s method [39], which is based on the (non-)overlap of confidence intervals and allows for dependency between correlations. Analyses were performed with cocor [40]. We used 89% confidence intervals here [41]. As visual checks showed that relevant distributions were distinctly non-normal, we deviated from the preregistration and relied on rS rather than Pearson’s r. (The OSF contains additional analyses, with 95% CI such as those based on Percentage Bend correlation leading the same conclusion. https://osf.io/ce6v3/?view_only=86b6b997ca52430898a6a2bdb38cf9bb)

Results and discussion

Descriptive statistics are shown in Table 3. For focal findings, we found a very strong negative n-ES correlation, rS = -.55, 89% CI [-.64, -.45]. In line with our reasoning, this correlation turned out to be smaller for randomly selected effect sizes, rS = -.37, 89% CI [-.48, -.22]. However, the 89% confidence intervals overlapped, and we therefore conclude that these correlations do not offer convincing support for our hypothesis that the n-ES correlation is stronger for focal effects than for effects chosen at random. This conclusion was not altered when we performed the analyses by type of design (between- vs. within-subjects). For between-subject designs (n = 135), we found a strong negative n-ES correlation for focal effect sizes rS = -.40, 89% CI [-.52, -.27] and a smaller one for randomly selected effect sizes, rS = -.29, 89% CI [-.42, -.16]. For within-subject designs (n = 25), we found a very strong negative n-ES correlation for focal effect sizes rS = -.67, 89% CI [-.83, -.41] and a much smaller one for randomly selected effect sizes, rS = -.17, 89% CI [-.48, .17].

Our analysis of focal findings largely followed the methods in [18]. We note that, in contrast to Study 1, our result (rS = -.55) was now quite similar to theirs (rS = -.45).

Further, comparisons of results across Studies 1 and 2 are instructive. The n-ES correlations in Study 2’s focal effects [-.64, -.45] and in Study 1’s cross-topics analysis across meta-analyses [-.25, -.22] differed reliably. This confirms our conclusion from Study 1 that the n-ES correlation is much stronger for effects sampled from the effects publications focus on than for effects sampled from meta-analyses. Further, the n-ES correlations in Study 2’s randomly selected effects [-.48, -.22] and in Study 1’s cross-topics analysis across meta-analyses [-.25, -.22] failed to differ reliably. Thus, the n-ES correlation for randomly selected effects is therefore not clearly more worrying than for focal effects nor clearly less worrying than for effects in meta-analyses. In light of these inconclusive results, we struggle to understand why the n-ES correlation differs so dramatically between effects in meta-analyses and publications’ focal effects. We note that meta-analyses in Study 1 stemmed from only three sub-disciplines whereas samples of focal effects stemmed from all of psychology, but it remains currently unclear if this can explain the observed differences.

General discussion

The n-ES correlation holds promise to indicate how widespread a problem PSB is and, following the introduction of counter measures, how this might change over time [14,18,33]). However, the n-ES correlation is also affected by researchers’ (unwitting or deliberate) adjustments of their sample size to the expected effect size, a perfectly reasonable behavior. Using data from psychology, we therefore investigated in greater detail to what extent the n-ES correlation suggests the presence of PSB in psychological research.

In Study 1, we found a small negative n-ES correlation within meta-analyses, which is consistent with previous results [33]. This proved to be virtually identical with the negative n-ES correlation that we observed in multiple replications, which are free from PSB. We also showed that a small negative n-ES correlations like these are plausible in the absence of PSB. Overall, we would therefore argue that the small negative n-ES correlation within psychological meta-analyses consistently observed by us, and by [33] suggests the absence of noteworthy PSB (at least in the three scrutinized sub-disciplines cognitive, organizational, and social psychology). (Similarly, our results suggest an n-ES correlation around rs = -.23 is no reason for concern.) This is in line with previous research which suggests that evidence for PSB in psychological meta-analyses is weak, and if PSB is present it is likely to be mild [17]. Similarly, previous research indicates that applying adjustments for PSB to psychological meta-analyses results in minimal changes to effect size estimates [42]. Obviously, that does not mean that PSB is never a problem in meta-analyses in psychology, and research into how best to uncover it remains important (e.g., [5,16,43].

The inconspicuous n-ES correlation for effects sampled from meta-analyses contrasts sharply with the one in cross-topics samples of focal effects (i.e., effects that take a central role in the papers they are published in): For cross-topics samples of focal effects, [18] and our Study 2 consistently found strong negative n-ES correlations. At a theoretical level, such a difference might be expected. First, across different topics researchers might (unwittingly or deliberately) adjust their sample size to the expected effect size, which induces a negative n-ES correlation. Such sample-size adjustment is less plausible to occur within meta-analyses. Here, researchers investigate the same topic and therefore would rarely have reasons to hold different expectations about the magnitude of the expected effect [15]. Second, it is plausible that PSB should affect focal effects in particular. For example, researchers who fail to find an expected effect but find an unexpected one instead might shift the focus of their publication on the latter [44]. By definition, cross-topics samples of focal effects consist of focal effects only, which is not true for meta-analyses. Thus, a smaller proportion of effects in meta-analyses should be affected by PSB, thus reducing the n-ES correlations within meta-analyses.

Our empirical evidence did not suggest that sample-size adjustment and stronger PSB in focal effects sufficiently account for the large difference in the n-ES correlation within meta-analyses versus cross-topics samples of focal effects. Although, we found evidence for sample-size adjustment in Study 1, this was too weak to explain the difference in the n-ES correlation within meta-analyses and across topics. Moreover, we failed to find clear evidence in Study 2 that the n-ES correlation is less pronounced for effects selected at random than for focal effects. In sum, it remains currently unclear why much stronger n-ES correlations are found in samples of focal effects than in samples of effects in meta-analyses and to what extent this reflects benign or problematic reasons. Although, our research suggests that some negative n-ES correlations might be seen as unproblematic, it currently remains unclear how strong n-ES correlations need to be to indicate nontrivial PSB effects. More research on these topics is needed.

The n-ES correlation is one among numerous indicators developed to indicate the presence of PSB (e.g., [16]). Here, we focussed on the n-ES correlation because previous surveys based on this method have fuelled concerns about widespread PSB in psychological research [18,33]. Various methods have been compared regarding their ability to uncover PSB in single meta-analyses (e.g., [16,17]. Whether the n-ES and other indicators differ in their suitability to describe PSB in the kind of larger surveys that we presented here is currently unclear.

Conclusion

Negative n-ES correlations have previously been described as evidence for PSB in psychological research [18,33]. We demonstrated here that the negative n-ES correlations in meta-analyses from three psychological sub-disciplines were inconspicuous and do not point to worrying levels of PSB. However, alternative sampling strategies lead to much stronger n-ES correlations partly explained by benign factors. The extent to which these heightened n-ES correlations also reflect the effects of PSB is currently unclear.

References

  1. 1. Greenwald A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82(1), 1–20.
  2. 2. Sterling T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54(285), 30–34.
  3. 3. Rosenthal R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638.
  4. 4. Simmons J. P., Nelson L. D., & Simonsohn U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. pmid:22006061
  5. 5. Stanley T., Doucouliagos H., Ioannidis J. P., & Carter E. C. (2021). Detecting publication selection bias through excess statistical significance. Research Synthesis Methods, 12, 776–795. pmid:34196473
  6. 6. Dwan K., Gamble C., Williamson P. R., & Kirkham J. J. (2013). Systematic review of the empirical evidence of study publication bias and outcome reporting bias—an updated review. PloS one, 8(7), e66844. pmid:23861749
  7. 7. Franco A., Malhotra N., & Simonovits G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502–1505.
  8. 8. Ferguson C. J., & Brannick M. T. (2012). Publication bias in psychological science: prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17(1), 120–128. pmid:21787082
  9. 9. McLeod B. D., & Weisz J. R. (2004). Using dissertations to examine potential bias in child and adolescent clinical trials. Journal of Consulting and Clinical Psychology, 72(2), 235. pmid:15065958
  10. 10. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 943–951.
  11. 11. Schäfer T., & Schwarz M. A. (2019). The meaningfulness of effect sizes in psychological research: Differences between sub-disciplines and the impact of potential biases. Frontiers in Psychology, 10, 813. pmid:31031679
  12. 12. Scheel A. M., Schijen M. R., & Lakens D. (2021). An excess of positive results: Comparing the standard Psychology literature with Registered Reports. Advances in Methods and Practices in Psychological Science, 4(2), 25152459211007467.
  13. 13. Siegel M., Eder J. S. N., Wicherts J. M., & Pietschnig J. (2022). Times are changing, bias isn’t: A meta-meta-analysis on publication bias detection practices, prevalence rates, and predictors in industrial/organizational psychology. Journal of Applied Psychology, 107(11), 2013–2039. pmid:34968082
  14. 14. Nelson L. D., Simmons J. P., & Simonsohn U. (2018). Psychology’s renaissance. Annual Review of Psychology, 69, 511–534. pmid:29068778
  15. 15. Linden A. H., & Hönekopp J. (2021). Heterogeneity of research results: a new perspective from which to assess and promote progress in psychological science. Perspectives on Psychological Science, 16(2), 358–376. pmid:33400613
  16. 16. Renkewitz F., & Keiner M. (2019). How to detect publication bias in psychological research. Zeitschrift für Psychologie, 227(4), 261–279.
  17. 17. van Aert R. C., Wicherts J. M., & van Assen M. A. (2019). Publication bias examined in meta-analyses from psychology and medicine: A meta-meta-analysis. PloS one, 14(4), e0215052. pmid:30978228
  18. 18. Kühberger A., Fritz A., & Scherndl T. (2014). Publication bias in psychology: a diagnosis based on the correlation between effect size and sample size. PloS one, 9(9), e105825. pmid:25192357
  19. 19. van Erp S., Verhagen J., Grasman R. P., & Wagenmakers E.-J. (2017). Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990–2013. Journal of Open Psychology Data, 5(1).
  20. 20. Egger M., Smith G. D., Schneider M., & Minder C. (1997). Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109), 629–634. pmid:9310563
  21. 21. Cheung I., Campbell L., LeBel E. P., Ackerman R. A., Aykutoğlu B., Bahník Š.,et al. (2016). Registered Replication Report: Study 1 from Finkel, Rusbult, Kumashiro, & Hannon (2002). Perspectives on Psychological Science, 11(5), 750–764.
  22. 22. Klein R. A., Ratliff K. A., Vianello M., Adams R. B. Jr, Bahník Š., Bernstein M. J.,et al. (2014). Investigating variation in replicability. Social Psychology, 45(3), 142–152.
  23. 23. Lovakov A., & Agadullina E. R. (2021). Empirically derived guidelines for effect size interpretation in social psychology. European Journal of Social Psychology, 51(3), 485–504.
  24. 24. Cohen J. (1988). Statistical power analysis for the behavioral sciences. Hilsdale (Vol. 2).
  25. 25. Cohen J. (1992). A power primer. Psychological Bulletin, 112, 155–159. pmid:19565683
  26. 26. Ebersole C. R., Atherton O. E., Belanger A. L., Skulborstad H. M., Allen J. M., Banks J. B., et al. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82.
  27. 27. Eerland A., Sherrill A. M., Magliano J. P., Zwaan R. A., Arnal J., Aucoin P., et al. (2016). Registered replication report: Hart & Albarracín (2011). Perspectives on Psychological Science, 11(1), 158–171.
  28. 28. Hagger M. S., Chatzisarantis N. L., Alberts H., Anggono C. O., Batailler C., Birt A. R., et al. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546–573. pmid:27474142
  29. 29. Klein R. A., Vianello M., Hasselman F., Adams B. G., Adams R. B. Jr, Alper S., et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490.
  30. 30. Wagenmakers E.-J., Beek T., Dijkhoff L., Gronau Q. F., Acosta A., Adams R. Jr, et al. (2016). Registered replication report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928.
  31. 31. Efron B., & Tibshirani R. J. (1994). An introduction to the bootstrap: CRC press.
  32. 32. R Core Team. (2013). R: A language and environment for statistical computing.
  33. 33. Levine T. R., Asada K. J., & Carpenter C. (2009). Sample sizes and effect sizes are negatively correlated in meta-analyses: Evidence and implications of a publication bias against nonsignificant findings. Communication Monographs, 76(3), 286–302.
  34. 34. Bonett D. G., & Wright T. A. (2000). Sample size requirements for estimating Pearson, Kendall and Spearman correlations. Psychometrika, 65(1), 23–28.
  35. 35. Hönekopp J., Becker B. J., & Oswald F. L. (2006). The meaning and suitability of various effect sizes for structured rater× ratee designs. Psychological Methods, 11(1), 72–86.
  36. 36. Simpson E. H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society: Series B (Methodological), 13(2), 238–241.
  37. 37. Faul F., Erdfelder E., Lang A.-G., & Buchner A. (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191. pmid:17695343
  38. 38. R Core Team. (2015). R: A language and environment for statistical computing. Retrieved from https://www.gbif.org/tool/81287/r-a-language-and-environment-for-statistical-computing
  39. 39. Zou G. Y. (2007). Toward using confidence intervals to compare correlations. Psychological Methods, 12(4), 399. pmid:18179351
  40. 40. Diedenhofen B., & Musch J. (2015). cocor: A comprehensive solution for the statistical comparison of correlations. PloS one, 10(4), e0121945. pmid:25835001
  41. 41. McElreath R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan: Chapman and Hall/CRC.
  42. 42. Sladekova M., Webb L. E., & Field A. P. (2022). Estimating the change in meta-analytic effect size estimates after the application of publication bias adjustment methods. Psychological Methods, 28, 664–686. pmid:35446048
  43. 43. Carter E. C., Schönbrodt F. D., Gervais W. M., & Hilgard J. (2019). Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science, 2(2), 115–144.
  44. 44. Kerr N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–217. pmid:15647155