Bias caused by sampling error in meta-analysis with small sample sizes

Background Meta-analyses frequently include studies with small sample sizes. Researchers usually fail to account for sampling error in the reported within-study variances; they model the observed study-specific effect sizes with the within-study variances and treat these sample variances as if they were the true variances. However, this sampling error may be influential when sample sizes are small. This article illustrates that the sampling error may lead to substantial bias in meta-analysis results. Methods We conducted extensive simulation studies to assess the bias caused by sampling error. Meta-analyses with continuous and binary outcomes were simulated with various ranges of sample size and extents of heterogeneity. We evaluated the bias and the confidence interval coverage for five commonly-used effect sizes (i.e., the mean difference, standardized mean difference, odds ratio, risk ratio, and risk difference). Results Sampling error did not cause noticeable bias when the effect size was the mean difference, but the standardized mean difference, odds ratio, risk ratio, and risk difference suffered from this bias to different extents. The bias in the estimated overall odds ratio and risk ratio was noticeable even when each individual study had more than 50 samples under some settings. Also, Hedges’ g, which is a bias-corrected estimate of the standardized mean difference within studies, might lead to larger bias than Cohen’s d in meta-analysis results. Conclusions Cautions are needed to perform meta-analyses with small sample sizes. The reported within-study variances may not be simply treated as the true variances, and their sampling error should be fully considered in such meta-analyses.


Introduction
Systematic reviews and meta-analyses have become important tools to synthesize results from various studies in a wide range of areas, especially in clinical and epidemiological research [1][2][3]. Sampling error is a critical issue in meta-analyses. On the one hand, it impacts the evaluation of heterogeneity between studies. For example, the popular heterogeneity measure I 2 statistic is supposed to quantify the proportion of variation due to heterogeneity rather than sampling error [4][5][6]; if sampling error increases, the I 2 tends to decrease, leading to a conclusion of more homogeneous studies. More troublesome, within-study sampling error may affect the derivation underlying I 2 to such an extent that the interpretation of I 2 is challenged [7]. On the other hand, sampling error may threaten the validity of a meta-analysis. The most popular meta-analysis method usually models the observed effect size in each study as a normally distributed random variable and treats the observed sample variance as if it was the true variance [8,9]. It accounts for sampling error in the point estimate of the treatment effect within each study, but it ignores sampling error in the observed variance. This method is generally valid when the number of samples within each collected study is large: the large-sample statistical properties, such as the central limit theory and the delta method, guarantee that the distribution approximation performs well. However, ignoring sampling error in within-study variances has caused some misunderstandings about basic quantities in meta-analyses, especially when some studies have few samples. For example, the famous Q test for homogeneity does not exactly follow the chi-squared distribution due to such sampling error [10], and this problem may subvert I 2 [7].
One important purpose of performing meta-analyses is to increase precision as well as to reduce bias for the conclusions of systematic reviews [11]. For this reason, the PRISMA statement [12] recommends researchers to report both the risks of bias within individual studies and also between studies. The bias within individual studies often relates to the studies' quality [13]. Also, certain measures have been designed to reduce bias in study-level estimates. For example, Hedges' g is considered less biased than Cohen's d within studies when the effect size is the standardized mean difference (page 81 in Hedges and Olkin [14]). The bias between studies is usually introduced by publication bias or selective reporting [15][16][17][18][19][20]. Besides the bias in point estimates of treatment effects, sampling error also produces bias in the variance of the overall weighted mean estimate under the fixed-effect setting [21,22]. Under the random-effects setting, the well-known DerSimonian-Laird estimator of the between-study variance may also have considerable bias, especially when sample sizes are small [23,24]. Moreover, the between-study bias in the treatment effect estimates, such as publication bias, may implicate other parameters in a meta-analysis, including the between-study variance [25]. The bias in variance estimates can seriously impact the precision of the meta-analysis results.
This article focuses on the performance of meta-analyses with small sample sizes, where the sampling error in the observed within-study variances may not be ignored. Throughout this article, we refer to sample size as the number of participants in an individual study, instead of the number of studies in a meta-analysis. Studies with small or moderate sample sizes are fairly common in meta-analyses [26], especially when the treatments are expensive and the enrollments of participants are limited by studies' budgets. We demonstrate a type of bias in meta-analysis results that is completely due to sampling error; it has received relatively less attention in the existing literature compared with other types of bias [27][28][29][30]. Such bias is mainly caused by the association between the observed study-specific effect sizes y i and their within-study variances s 2 i . This association may exist even in the absence of publication bias or selective reporting [31,32]. When one uses the true variances instead of the estimated variances, the association may still be present for certain effect sizes, e.g., the (log) odds ratio.
If each study's result is unbiased and its marginal expectation equals to some overall treatment effect θ, then a naïve argument for the unbiasedness of the overall effect estimate in a meta-analysis,ŷ ¼ The weight is usually the inverse of the within-study variance or the marginal variance incorporating heterogeneity between studies. However, this equation treats the weights w i as fixed values, while in practice they are estimates subject to sampling error. The association between the observed effect sizes and their estimated within-study variances may be strong when the sample sizes are small, so the expectation of the overall estimate in the meta-analysis may not be directly derived without the information about such association, and its unbiasedness is largely unclear [33]. In addition, when the sample sizes are small, the sampling error in the observed within-study variances and the estimated between-study variance may be large, so the confidence interval (CI) of the overall estimate may be poor with coverage probability much lower than the nominal level.
In the following sections, we will review five common effect sizes for continuous and binary outcomes, explain how small sample sizes may introduce bias in meta-analyses, and evaluate such bias using extensive simulation studies.

Meta-analyses with continuous outcomes
Suppose that a meta-analysis contains N studies, and each study compares a treatment group with a control group. Denote n i0 and n i1 as the sample sizes in the control and treatment groups in study i. The continuous outcome measures of the participants in each group are assumed to follow normal distributions. The population means of the two groups in study i are μ i0 and μ i1 , and the sample means are denoted as " y i0 and " y i1 accordingly. The variances of the samples in the two groups are frequently assumed to be equal, denoted as s 2 i ; see, e.g., page 76 in Hedges and Olkin [14] and page 224 in Cooper et al. [34]. The s 2 i is estimated as the pooled sample variance , where s 2 i0 and s 2 i1 are the sample variances in the control and treatment groups, respectively.
If the outcome measures have a meaningful scale and all studies in the meta-analysis are reported on the same scale, the mean difference (MD) between the two groups, i.e., Δ i = μ i1 − μ i0 , is often used as the effect size to measure treatment effect (page 224 in Cooper et al. [34]). We can obtain an estimate of the MD from each study, denoted as y i ¼ " y i1 À " y i0 , and its estimated within-study variance is Traditional meta-analysis methods usually account for sampling error in the sample means y i but ignore such error in the sample variances s 2 i ; the within-study variances have been customarily treated as the true variances, which should be 1 n i0 þ 1 n i1 s 2 i [10]. However, accurate estimates of variances may require very large sample sizes; the sample variances s 2 i may be far away from their true values when sample sizes are small. In the following context, we will treat the sample variances as random variables like the sample means, instead of the true variances.
Because the outcome measures are assumed to be normal, the sample means " y i0 and " y i1 are independent of the sample variances s 2 i0 and s 2 i1 (see page 218 in Casella and Berger [35]). Thus, the y i and s 2 i are independent in each study. Given that the observed MDs y i are unbiased, such independence guarantees that the overall effect size estimate is unbiased in a fixed-effect meta-analysis (which assumes that the underlying true effect sizes Δ i in all studies equal to a com- However, in a random-effects meta-analysis, each study's weight is updated as 1=ðs 2 i þt 2 Þ by incorporating an estimate of the between-study variancet 2 . The between-study variance τ 2 can be estimated using many different methods [36], and its estimate depends on both y i and s 2 i ; therefore, y i and the updated weight 1=ðs 2 i þt 2 Þ may be correlated to some extents. The expectation of the weighted average cannot be split in the foregoing way, so the unbiasedness of the overall MD estimate is not guaranteed in a random-effects meta-analysis. A more commonly-used effect size for continuous outcomes is the standardized mean difference (SMD), because this unit-free measure permits different scales in the collected studies and is deemed more comparable across studies (see Normand [8] and Chapter 3 in Grissom and Kim [37]). The true SMD in study i is It is usually estimated as y i ¼ " by plugging in the sample means and the pooled variance, and is often referred to as Cohen's d (see page 66 in Cohen [38]). If we define a constant q i = n i0 n i1 /(n i0 + n i1 ), multiply Cohen's d by ffiffiffi ffi q i p , and express it as ffiffiffi ffi q i p y i ¼ ffiffi ffi q i p ð" y i1 À " y i0 Þ=s i s iP =s i , then the numerator follows a normal distribution with variance 1, and the denominator is the square root of a chi-squared random variable ðn i0 þ n i1 À 2Þs 2 ip =s 2 i divided by its degrees of freedom n i0 + n i1 − 2 [35]. Also, the numerator and denominator are independent. Therefore, strictly speaking, Cohen's d (multiplied by the constant ffiffiffi ffi q i p ) follows a t-distribution, although it is approximated as a normal distribution in nearly all applications. If the true effect size is non-zero, the t-distribution is noncentral. The exact within-study variance of Cohen's d can be derived as a complicated form of gamma functions [39], but researchers usually use some simpler forms to approximate it. Different approximation forms for the within-study variance of Cohen's d are given in several books on metaanalyses; see, e.g., page 80 in Hedges and Olkin [14], page 226 in Cooper et al. [34], and page 290 in Egger et al. [40]. This article approximates it as 2ðn i0 þn i1 À 2Þ . As s 2 i depends on y i , they are correlated. The correlation may increase as the sample size decreases, because the coefficient of y 2 i in the formula of s 2 i ; 1 2ðn i0 þn i1 À 2Þ , increases. Furthermore, it is well-known that Cohen's d is a biased estimate of the SMD. The bias is around 3y i 4ðn i0 þn i1 ÞÀ 9 (page 80 in Hedges and Olkin [14]); and it reduces toward zero as the sample sizes increase. When the sample sizes are small, a bias-corrected estimate, called Hedges' g, is usually adopted [41]. It is calculated as with an estimated variance 2ðn i0 þn i1 Þ (page 86 in Hedges and Olkin [14]). Like Cohen's d, the observed data y i and s 2 i are also correlated when using Hedges' g as the effect size. Therefore, even if Hedges' g is (nearly) unbiased within each individual study, the overall SMD estimate in the meta-analysis may be still biased due to the correlation between y i and s 2 i .

Meta-analyses with binary outcomes
Suppose a 2 × 2 table is available from each collected study in a meta-analysis with a binary outcome. Denote n i00 and n i01 as the numbers of participants without and with an event in the control group, respectively; n i10 and n i11 are the data cells in the treatment group. The sample sizes in the control and treatment groups are n i0 = n i00 + n i01 and n i1 = n i10 + n i11 . Also, denote p i0 and p i1 as the population event rates in the two groups. The odds ratio (OR) is frequently used to measure treatment effect for a binary outcome [42]; its true value in study i is Using the four data cells in the 2×2 table, the OR is estimated as c OR i ¼ n i00 n i11 n i01 n i10 . The ORs are usually combined on a logarithmic scale in meta-analyses, because the distribution of the estimated log OR, y i ¼ log c OR i is better approximated by a normal distribution. The within-study variance of y i is estimated as Besides the OR, the risk ratio (RR) and risk difference (RD) are also popular effect sizes. The underlying true RR in study i is RR i = p i1 /p i0 , and it is also combined on the log scale in meta-analyses like the OR. The log RR is estimated as y i ¼ log n i11 =n i1 n i01 =n i0 , and its within-study variance is estimated as . When the sample sizes are small, some data cells may be zero even if the event is not rare. If a 2 × 2 table contains zero cells, a fixed value of 0.5 is often added to each data cell to reduce bias and avoid computational error (see page 521 in the Cochrane Handbook for Systematic Reviews of Interventions [43] and many other papers [44][45][46]), although this continuity correction may not be optimal in some cases [47][48][49][50]. Like the SMD for continuous outcomes, the distributions of the sample log OR, log RR, and RD are approximated as normal distributions in conventional meta-analysis methods. Also, because both y i and s 2 i depend on the four cells of 2 × 2 tables for all three effect sizes, they are intrinsically correlated.

Simulation studies
We conducted simulation studies to investigate the impact of sampling error on meta-analyses with small sample sizes. The number of studies in a simulated meta-analysis was set to N = 5, 10, 20, and 50. We first generated the sample size within each study n i from a uniform distribution U(5, 10), then we gradually increased it by sampling it from U (10,20), U (20,30), U (30,50), U(50, 100), U(100, 500), and U(500, 1000). These sample sizes n i were generated anew for each simulated meta-analysis. The control/treatment allocation ratio was set to 1:1 in all studies, which is common in real-world applications. Specifically, n i0 ¼ d n i 2 e participants were assigned to the control group and n i1 ¼ n i À d n i 2 e participants were assigned to the treatment group, where dxe represents an integer that is greater than or equal to x.
When the outcome was continuous, we simulated meta-analyses based on the MD and the SMD. For the MD, each participant's outcome measure was sampled from Nðm i0 ; s 2 i Þ in the control group or Nðm i0 þ D i ; s 2 i Þ in the treatment group. Without loss of generality, the baseline effect μ i0 of study i was generated from N(0,1). The study-specific standard deviation σ i was sampled from U (1,5), and it was generated anew for each simulated meta-analysis. The mean difference Δ i was sampled from N(Δ,τ 2 ). The overall MD Δ was set to 0, 0.5, 1, 2, and 5, and the between-study standard deviation τ was set to 0, 0.5, and 1. For the SMD, each participant's outcome measure was also generated using the foregoing setting within each study, but the SMD θ i = Δ i /σ i , not the mean difference Δ i , was sampled from the normal distribution: θ iÑ (θ,τ 2 ). The overall SMD θ was set to 0, 0.2, 0.5, 0.8, and 1 to represent different magnitudes of effect size. The between-study standard deviation τ was 0, 0.2, and 0.5. Both Cohen's d and Hedges' g were used to estimate the SMD.
When the outcome was binary, we first simulated meta-analyses based on the OR. The event numbers n i01 and n i11 in the control and treatment groups were sampled from Binomial (n i0 ,p i0 ) and Binomial(n i1 ,p i1 ), respectively. The event rate in the control group p i0 was sampled from U(0.3, 0.7) representing a fairly common event [32], and it was generated anew for each meta-analysis. The event rate in the treatment group p i1 was calculated using p i0 and the studyspecific log OR θ i ; specifically, p i1 ¼ ½1 þ e À y i ð1 À p i0 Þ=p i0 À 1 . The study-specific log OR θ i was sampled from N(θ,τ 2 ), where the overall log OR θ was set to 0, 0.2, 0.4, 1, and 1.5, and the between-study standard deviation τ was 0, 0.2, and 0.5. In addition to the OR, we also generated meta-analyses based on the RR and RD. The event numbers were similarly sampled from binomial distributions and the p i0 was from U(0.3,0.7). However, for the log RR and the RD, we considered only the fixed-effect setting with all study-specific effect sizes θ i equal to a common value θ. Specifically, if the effect size was the log RR, the event rate in the treatment group was p i1 = e θ p i0 , where the true log RR θ was set to 0 and 0.3 to guarantee that p i1 was between 0 and 1. If the effect size was the RD, p i1 = p i0 +θ, where the true RD θ was set to 0 and 0.2 to guarantee that p i1 was between 0 and 1. The random-effects setting was not considered for the log RR and RD because it may lead to improper p i1 's beyond the [0, 1] range. We could successfully generate meta-analyses by truncating such improper p i1 's and constraining them to be between 0 and 1; however, this constraint would produce bias, which cannot be distinguished from the bias caused by sampling error that is of primary interest in this article. For each simulation setting above, 10,000 meta-analyses were generated. The random-effects model was applied to each simulated meta-analysis [51], even if some meta-analyses were generated under the fixed-effect setting with τ = 0. Thus, the produced CIs might be conservative. Also, the between-study variance was estimated using the popular method of moments by DerSimonian and Laird [24]. The restricted maximum likelihood method may be a better choice [8,52,53], but it is more computationally difficult and its solution did not converge in a noticeable number of our simulated meta-analyses. Also, there are many other alternatives for estimating the between-study variance, such as the Paule-Mandel estimator, which may be recommended in certain situations [54,55], while they have been used less frequently compared with the DerSimonian-Laird estimator so far. Therefore, we considered only the DerSimonian-Laird estimator for the between-study variance, which was sufficient to achieve this article's purpose.
S2-S7 Files present the R code and results for the simulation studies. For each sample size range on the horizontal axis, the left gray box was obtained using Cohen's d, and the right black box was obtained using Hedges' g. The true between-study standard deviation τ increased from 0 (upper and middle panels) to 0.5 (lower panels). The number of studies in each meta-analysis N increased from 5 (upper panels) to 50 (middle and lower panels). The true standardized mean difference θ (horizontal dotted line) increased from 0 (left panels) to 1 (right panels).  The true between-study standard deviation τ was 0 (i.e., the simulated studies were homogeneous). The number of studies in each meta-analysis N increased from 5 (upper panels) to 50 (lower panels). The true log risk ratio θ (horizontal dotted line) increased from 0 (left panels) to 0.3 (right panels).

Fig 5.
Boxplots of the estimated risk differences in 10,000 simulated meta-analyses. The true between-study standard deviation τ was 0 (i.e., the simulated studies were homogeneous). The number of studies in each meta-analysis N increased from 5 (upper panels) to 50 (lower panels). The true risk difference θ (horizontal dotted line) increased from 0 (left panels) to 0.2 (right panels).   RD, respectively. In addition, Table 1 shows the bias of the estimates and Table 2 shows their 95% CIs' coverage probabilities. When the number of studies in a meta-analysis increased from 5 to 50, the range of the estimated overall effect size shrank because their variances decreased. When the between-study heterogeneity increased in Fig 1-3, the middle and lower panels indicate that the box of the estimated overall effect sizes expanded vertically due to more heterogeneity in the meta-analyses. Fig 1 and Table 1 indicate that the estimated MD was almost unbiased in all situations with different numbers of studies and different extents of heterogeneity, even if the studies had very small sample sizes. As the trends in the plots for Δ = 0.5, 1, 2, and 5 were fairly similar to those for Δ = 0, they were not displayed in Fig 1 due to space limit. Table 2 shows that the CI coverage probability of the MD was fairly close to the nominal confidence level 95% in most cases. The coverage was slightly below 95% when the number of studies was small (N = 5) and the sample sizes were also very small (between 5 and 10) within studies.
When the true SMD was zero in the left panels of Fig 2, both Cohen's d and Hedges' g were almost unbiased. The box of Cohen's d was slightly larger vertically than that of Hedges' g when the sample sizes within studies were small, so the point estimates of Hedges' g were more concentrated around the true SMD. The CI coverage was also close to the nominal 95% level. However, as the true SMD increased from 0 to 0.5 and to 1, both Cohen's d and Hedges'g began to have bias, and the bias increased as the sample sizes decreased within studies. Cohen's d generally produced less bias in the estimated overall SMD than Hedges' g, as shown in Table 1. The CI coverage of Cohen's d was still close to 95% when the true SMD increased, but that of Hedges' g dropped below 80% when the sample size was fairly small (between 5 and 10), the true SMD was fairly large (θ = 1), and the number of studies was large (N = 50).
The patterns in Fig 3 of the ORs for binary outcomes were similar to those in Fig 2. The estimated overall log ORs were almost unbiased when the true log OR was zero. As the true log OR increased to 1 and to 1.5 and the sample sizes within studies decreased, the bias in the estimated overall log OR tended to be larger in the negative direction. Also, the CI coverage dropped dramatically when the number of studies and the between-study variance were large in Table 2. For example, when τ = 0, N = 5, θ = 1.5, and the sample size of each study was between 5 and 10, the bias of the estimated overall log OR was −0.30 and the CI coverage was 97.8%. The log OR underestimated the true value θ. Among the simulated meta-analyses whose CIs did not cover θ, 2.2% had CIs entirely below θ, while only one meta-analysis (0.01%) had a CI entirely above θ. As the number of studies increased to N = 50 and other parameters unchanged, the bias was still −0.30, but the CI coverage decreased to 82.5%. The CIs of the meta-analyses not covering θ were all below θ. Therefore, the low CI coverage was likely because the CI became shorter as the number of studies N increased while the bias remained.
Compared with the log OR, the log RR in Fig 4 was more sensitive to the sample sizes within studies. The estimated overall log RR had tiny bias and its CI coverage was close to 95% when the sample sizes within studies were large (more than 500). However, the bias was substantial and the CI coverage was fairly low even when the sample sizes were moderate (between 50 and 100). Like the situation for the log OR, the poor CI coverage for the log RR related to the bias. For example, when τ = 0, N = 5, θ = 0.3, and the sample size of each study was between 5 and 10, the bias of the estimated overall log RR was 0.26 and the CI coverage was 86.2%. The log RR overestimated the true value θ. The CIs of the simulated meta-analyses not covering θ were all above θ. When N increased to 50 and other parameters unchanged, the bias was 0.35 and the CI coverage dropped dramatically to 9.7%. The CIs of the simulated meta-analyses not covering θ were also all above θ.
Fig 5 shows that the estimated overall RD was almost unbiased when the true RD was zero and had small bias when the true RD was 0.2. The bias was relatively large when the sample sizes within studies were fairly small. The CI coverages were between 92% and 96% in all situations.
In addition, Figures A-F in S1 File present scatter plots of the sample effect sizes against their precisions (i.e., the inverse of their sample variances) in ten selected simulated meta-analyses with small sample sizes for the MD, SMD (including both Cohen's d and Hedges' g), log OR, log RR, and RD. They are plotted using the same idea of the funnel plot for assessing publication bias [56], and they roughly illustrate the association between the sample effect sizes y i and their within-study variances s 2 i . Figure A in S1 File indicates that this association seemed tiny for the MD, which was consistent with our conclusion that the MD y i and its variance s 2 i are independent in theory. The other figures show different extents of association for the SMD, log OR, log RR, and RD. For example, the estimated SMDs that were closer to zero tended to have larger precisions (i.e., smaller variances) in Figures B and C in S1 File.

Discussion
This article has shown that the bias in the overall estimates of the SMD, log OR, log RR, and RD may be substantial in meta-analyses with small sample sizes. The estimated overall MD was almost unbiased in nearly all simulation settings, mainly because its point estimate and within-study variance were independent. However, for the other four effect sizes except the MD, the intrinsic association between their point estimates and estimated variances within studies may be strong, so the meta-analysis results were biased in many simulation settings. Therefore, when the collected studies have small sample sizes, researchers need to choose a proper effect size and perform the meta-analysis with great cautions.
Surprisingly, to estimate the overall SMD, using Cohen's d led to noticeably less bias than using Hedges' g in our simulation studies, although Hedges' g was designed as a bias-corrected estimate of the SMD within individual studies. For example, in one of our simulated fixedeffect meta-analyses with 50 studies and 5 to 10 samples in each study (the true SMD was 1), the average of Cohen's d in the 50 studies was around 1.29, while the average of Hedges' g 1.07 was closer to the true value 1. This was consistent with the fact that Hedges' g was generally less biased within individual studies. However, the meta-analytic overall Cohen's d was 0.98, which was much closer to 1 compared with the meta-analytic overall Hedges' g 0.89, because of the sampling error in these effect sizes' variances that caused the association between the effect sizes and the variances. Note that, instead of advocating that Cohen's d is always preferred than Hedges' g in meta-analyses, this article only reminds researchers that Cohen's d may be less biased in at least some meta-analytic results, and the argument for the use of Hedges' g in the presence of small sample sizes needs to be carefully examined.
In addition, there are alternative methods to estimate the within-study variance of Hedges' g besides the one used in our article. Specifically, our simulation studies used s 2  [34]. Using this alternative calculation for the within-study variances of Hedges' g, the combined SMD may remain biased. For example, consider a special case that all N studies in a meta-analysis have the same sample size n, so the bias-correction coefficients in all studies are equal: J i = J. Using the fixed-effect model, the expectation of the combined Cohen's d is and the expectation of the combined Hedges' g is Because J is a coefficient always less than 1, we have μ g < μ d if assuming μ d is positive. If the true overall SMD θ is also positive and the combined Cohen's d underestimates it (as in our simulation studies), then μ g < μ d < θ, indicating that the combined Hedges' g is more biased. However, if the combined Cohen's d overestimates the overall SMD (i.e., μ d > θ), then the combined Hedges' g might be less biased. This article helps explain the phenomenon of the inflated type I error rates for testing for publication bias. To detect potential publication bias in meta-analyses, it has been popular to check for the association between the study-specific effect sizes and their standard errors using the funnel plot or Egger's regression test [15]. However, it is well known that such association may be intrinsic for binary outcomes even if no publication bias appears, so Egger's test may have an inflated type I error rate [31,32]. In addition to the intrinsic association for binary outcomes, this article indicates that such a problem also exists when using the SMD for continuous outcomes. Although the meta-analyses with false positive results do not truly have publication bias, they may still suffer from bias due to sampling error.
Moreover, our findings imply that the magnitude of sample size may not be viewed as an absolute concept in meta-analyses; we may not determine whether a sample size is small or large without taking other parameters into account. For example, using the log OR as the effect size, Fig 3(A), 3(D) and 3(G) show that a sample size of 10 to 20 may be large enough to produce desirable meta-analysis results when the true log OR is zero. However, when the heterogeneity, the number of studies, and the true log OR are large, Fig 3(I) shows that a sample size of 50 to 100 may not be adequate.
The bias of the estimated overall log RR was particularly substantial in Fig 4; this may be related to the effect of the weighting bias for binary outcomes [57]. However, unlike the purpose of Tang [57], this article focused on the bias completely due to sample error which exists for both continuous and binary outcomes.
This article performed the simulated meta-analyses using the popular inverse-of-variance method in a frequentist way. Alternatively, several exact models have been proposed for binary outcomes; they do not require the normal approximation to estimate the study-specific effect sizes and their within-study variances [58][59][60][61][62]. The event numbers in the compared groups can be directly modeled as binomial distributions, thus accounting for sampling error in both point estimates of effect sizes and their variances. Similar exact models are also needed for continuous outcomes to avoid treating the within-study variances as if they were the true variances; we leave them as future work.
Supporting information S1 File. Scatter plots of the sample effect sizes against their precisions (i.e., the inverse of their sample variances) in some simulated meta-analyses with small sample sizes. (PDF)