Meta-analyses of positive psychology interventions: The effects are much smaller than previously reported

For at least four decades, researchers have studied the effectiveness of interventions designed to increase well-being. These interventions have become known as positive psychology interventions (PPIs). Two highly cited meta-analyses examined the effectiveness of PPIs on well-being and depression: Sin and Lyubomirsky (2009) and Bolier et al. (2013). Sin and Lyubomirsky reported larger effects of PPIs on well-being (r = .29) and depression (r = .31) than Bolier et al. reported for subjective well-being (r = .17), psychological well-being (r = .10), and depression (r = .11). A detailed examination of the two meta-analyses reveals that the authors employed different approaches, used different inclusion and exclusion criteria, analyzed different sets of studies, described their methods with insufficient detail to compare them clearly, and did not report or properly account for significant small sample size bias. The first objective of the current study was to reanalyze the studies selected in each of the published meta-analyses, while taking into account small sample size bias. The second objective was to replicate each meta-analysis by extracting relevant effect sizes directly from the primary studies included in the meta-analyses. The present study revealed three key findings: (1) many of the primary studies used a small sample size; (2) small sample size bias was pronounced in many of the analyses; and (3) when small sample size bias was taken into account, the effect of PPIs on well-being were small but significant (approximately r = .10), whereas the effect of PPIs on depression were variable, dependent on outliers, and generally not statistically significant. Future PPI research needs to focus on increasing sample sizes. A future meta-analyses of this research needs to assess cumulative effects from a comprehensive collection of primary studies while being mindful of issues such as small sample size bias.


Introduction
Mental health has often been conceptualized as the absence of negative symptomatology [1]. Traditionally, research and intervention efforts in psychology have reflected this conceptualization by focusing primarily on deficits, disease and dysfunction. Although this focus has been invaluable to psychology, the expanding field of positive psychology offers a complementary approach by focusing on understanding and increasing well-being, defined by Ryan and Deci [2] as "optimal psychological functioning and experience" (p. 1), and the components of wellbeing including strengths, life satisfaction, happiness, and positive behaviours [3,4]. Together, the traditional approach to psychology along with positive psychology, provide a well-balanced understanding of humanity [4] that is consistent with the World Health Organization's view that "Health is a state of complete physical, mental, and social well-being and not merely the absence of disease or infirmity." [5].
Seligman [6] identified five essential factors of well-being: Positive emotions, Engagement, Relationships, Meaning, and Accomplishment (PERMA). More specifically, well-being is made up of two similar, yet distinct components: subjective well-being and psychological wellbeing. Subjective well-being (SWB), also referred to as hedonic perspective of well-being, is the emotional and cognitive interpretation of the quality of one's life, and is often assessed by examining one's happiness, affect, and satisfaction with life [7,8]. Psychological well-being (PWB), also referred to as a eudaimonic perspective of well-being, includes positive relations, personal maturity, growth, and independence [9]. PWB reflects a broader, more multidimensional construct than SWB. Ryff developed a model of PWB with six dimensions: (1) Self acceptance (viewing oneself positively); (2) Positive relations with others (the ability to be empathetic and connect with others in more than superficial ways); (3) Autonomy (self-motivation and independence); (4) Environmental mastery (the ability and maturity to control and choose environments that are most appropriate); (5) Purpose in life (a sense of belonging, significance, and chosen direction); and (6) Personal growth (continuously seeking growth and optimal functioning). Both components of well-being have led researchers to different hypotheses and interests, continually providing both similar and dissimilar findings [2,10]. In sum, well-being is a broad, multidimensional, construct that includes one's affect, satisfaction with life, happiness, engagement with others, personal growth, and meaning and functioning in life.
Thus, although decreasing or eliminating negative symptomatology is necessary, it is not sufficient to achieve overall well-being. Health-care practitioners and researchers must also focus on prevention and intervention strategies that create, build upon, and foster well-being. Positive psychology interventions (PPIs) should be used to supplement approaches that address poor health. Rather than focusing directly on decreasing negative symptomatology, PPIs aim to increase positive affect, meaning in life, and engagement [1]. For healthy populations, the aim is to bring clients from a 'languishing' state of being to a 'flourishing' state of being [11]. For subclinical and clinical populations, the goals are to significantly reduce negative symptomatology and increase well-being [12]. PPIs are typically easy to follow, selfadministered, and brief.
Fordyce [13] developed the first documented PPI designed to increase happiness. This PPI was comprised of 14 techniques including spending more time with others, enhancing close relationships, thinking positively, admiring and appreciating happiness, and refraining from worrying. More recent and common interventions developed and tested by Seligman, Steen, Park, and Peterson [4] include: (1) Gratitude visits/letters-where participants write and deliver a letter of gratitude to someone who has been particularly kind or helpful in the past, but who was never suitably thanked; (2) Three good things-each night for one week participants write down three good things that went well each day and identify the reasons these things went well; (3) You at your best-participants write a story of when they were at their best, identify their personal strengths that were utilized in the story, and then read this story and review their personal strengths each day for one week; and (4) Using signature strengths-participants complete and receive feedback from the character strengths inventory [14], and then use one of their top five character strengths in a different way each day for one week. There are many other similar interventions, such as loving kindness meditation [15], acts of kindness [16], hope therapy [17], optimism exercises [18], mindfulness-based strength practices [19], well-being therapy [20,21], and positive psychotherapy [1].
Sin and Lyubomirsky [22] published the first meta-analysis of the effectiveness of PPIs. In the ten years since its publication, this meta-analysis has been cited nearly 2,000 times, highlighting the interest in the effectiveness of PPIs. Sin and Lyubomirsky's reported that the PPIs had a moderate effect on improving well-being and decreasing depression. For wellbeing, the meta-analysis revealed a significant effect size of r = .29 (equivalent to d = .61) based on 49 studies. For decreasing depressive symptomatology, a significant effect size of r = .31 (equivalent to d = . 65) was found based on 25 studies. Four years later, Bolier, Haverman, Westernhof, Riper, Smit, and Bohlmeijer [23] published a second highly cited meta-analysis of the effectiveness of PPIs focusing only on randomized controlled studies. Bolier et al. reported much smaller effects than Sin and Lyubomirsky. Bolier et al.'s meta-analysis revealed a significant effect size of r = .17 (d = .34) for subjective well-being, r = .10 (d = .20) for psychological well-being, and r = .11 (d = .23) for depression. Moreover, after they removed outlier effect sizes, the effect sizes decreased to r = .13 (d = .26) for subjective well-being, r = .08 (d = .17) for psychological well-being, and r = .09 (d = .18) for depression. Notwithstanding the dissimilar findings of the effect sizes of the PPIs, the high citation rates of these two meta-analyses highlight the recent and widespread interest in positive psychology.
Schuller, Kashdan, and Parks [24] recently criticized Bolier et al.'s [23] meta-analysis as unreasonably selective, narrow, and non-comprehensive. They cautioned against drawing any conclusions from Bolier et al's meta-analysis for at least the following reasons. First, Bolier et al. substantially truncated their search by excluding studies prior to 1998 (". . .the start of the positive psychology movement)" (p. 2). This eliminated earlier interventions including the seminal work of Fordyce [13,25]. Second, Bolier et al. only included studies that referenced "positive psychology". Because of this inclusion criterion, numerous relevant studies (e.g. studies using the "Best Possible Self" intervention) were omitted. Third, Bolier et al. excluded interventions that utilized meditation, mindfulness, forgiveness, and life-review because reviews and meta-analyses had already been conducted for these types of interventions. However, the elimination of a particular type of intervention or blend of interventions from a meta-analysis is an obstacle to determining how effective PPIs are in general. Moreover, meta-analyses restricted to a specific type of PPI makes it impossible to compare the effectiveness of the full range of PPIs. Because of the restrictive inclusion criteria, the estimated effect sizes are relevant to only the particular blend of PPIs retrieved by Bolier et al. [23]. In any case, because the scope of this meta-analysis was restricted, conclusions regarding the effectiveness of PPIs in general, and the effectiveness of many particular types of PPIs are limited.
In contrast to Bolier et al., Sin and Lyubomirsky's [22] did not constrain their selection of primary studies and because of this, they identified many more relevant studies than Bolier et al. despite that they published their meta-analysis four years earlier. However, it is impossible to assess how comprehensive Sin and Lyubomirsky's [22] meta-analysis was because the search for primary studies was not adequately described and therefore, not replicable. For example, the search parameters were not sufficiently described and the search strategy included searching whatever was available in Sin and Lyubomirsky's private libraries and gathering studies from their colleagues. The literature search described in Bolier et al. [23] was similarly not replicable. For example, although Bolier at el. [23] listed numerous terms they used in conducting their searches, they did not specify how they combined them when conducting their searches.
A critical review reveals five additional serious methodological issues that were not adequately addressed in either meta-analysis, that undermine their conclusions, and that may help explain the differences in their findings. First, Sin and Lyubomirsky [22] reported only averaged unweighted rs as effect size estimates for well-being and depression (see Table 4, p. 478, in Sin & Lyubomirsky). However, these estimates give the same weight to all studies, regardless of sample size, and are widely considered inappropriate [26].
Second, the previous meta-analyses did not describe in sufficient detail how they calculated effect sizes for each primary study. For example, Sin and Lyubomirsky [22] stated that effect sizes were "computed from Cohen's d, F, t, p, or descriptive statistics" (p. 469). Bolier et al. [23] state that they calculated Cohen's d from the post intervention means and standard deviations and, in some instances, "on the basis of pre-post-change score" without giving any further details. This lack of clarity is especially important because the calculation of effect sizes differs depending on study design (e.g., whether the study is a between-subject or within-subject design; [27]). Thus, effect size calculations can produce different results depending on whether the study used a repeated measure design [28]. In repeated measures designs, when effect sizes are calculated from test statistics such as Fs, and ts using usual formulae, the resulting effect sizes can be substantially inflated [29,30].
Third, Sin and Lyubomirsky's [22] and Bolier et al.'s [23] meta-analyses included articles that were common to both studies. However, we calculated a relatively low correlation between the effect sizes extracted by Sin and Lyubomirsky [22] and Bolier et al. [23], suggesting that the effect sizes were determined differently in the two meta-analyses.
Fourth, an examination of Sin and Lyubomirsky's [22] Tables 1 and 2 indicated the presence of small sample size bias. Small sample size bias (also called small study bias) occurs when smaller studies (with less precise findings) report larger effects than larger studies (with more precise findings). Small sample size bias is frequently the result of publication bias. It is well established that journals are much more inclined to publish studies with statistically significant findings than studies reporting null effects [31]. Thus, small studies, which typically report much larger effect sizes than larger studies, are more likely to be published. In turn, small sample size bias has become a significant problem in meta-analyses and numerous methods have been developed for identifying and estimating effect sizes in the presence of small sample size bias [27]. Although Sin and Lyubomirsky [22] noted asymmetry in a funnel plot of their data, they did not include the funnel plots in their article. However, relying on the Fail-safe N, they argued that even though publication bias may be present, it is ". . .not large enough to render the overall results nonsignificant" (p. 477) [22]. However, Fail-safe N method is no longer considered useful in assessing the significance of small sample bias because it considers only statistical significance rather than substantive or practical significance, and it improperly assumes that effect sizes in the unpublished studies are zero [26].
In contrast to Sin and Lyubomirsky [22], Bolier et al. [23] addressed publication bias by computing the Orwin's fail-safe number, and by using the Trim and Fill method [32]. Although the Orwin's fail-safe number and Trim and Fill methods used to address publication bias are preferred over the Fail-safe N method, these approaches are limited and have been superseded by more advanced methods designed to estimate an effect size in the presence of small study bias including cumulative meta-analyses, the top 10%, and limit meta-analyses [33][34][35]. Thus, it is unclear whether a reanalysis of Sin and Lyubomirsky's [22] and Bolier et al.'s [23] data, using more appropriate methods for taking into account small sample size effects, would confirm their findings or result in smaller effect size estimates.
Fifth, both Sin and Lyubomirsky and Bolier et al. also reported a number of group moderator analyses. Sin and Lyubomirsky reported six moderator analyses on well-being and six moderator analyses on depression. Similarly, Bolier et al. reported six moderator analyses on subjective well-being, six on psychological well-being, and six on depression. The inspection of these moderator analyses shows that groups consisted as few as two studies in Sin and Lyubomirsky (10 out of 12 moderator analyses included groups with 10 or fewer studies), and as few as one study in Bolier et al. (15 out of 16 moderator analyses included groups with 10 or fewer studies). Moreover, the number of studies in the moderator groups was widely discrepant for most of their moderator analyses. However, moderator analyses based on such a small number of studies in individual groups are not powerful enough to detect even large moderator effects [36]. Moreover, the power to detect moderator effects decreases still further when the number of studies in moderator groups is unequal [36]. Thus, in addition to the issues detailed above, the moderator analyses lacked the statistical power to make them meaningful. Accordingly, the current study had two major objectives. The first objective was to reanalyze the reported data provided by the two meta-analyses while taking into account small sample size bias and comparing the findings to the original meta-analyses. The second objective Note. was to replicate the two meta-analyses starting with extracting relevant data to calculate effect sizes directly from the primary studies rather than relying on the data published in the previous meta-analyses. In conducting these meta-analyses, the data were analyzed using weighted random effect models while taking into account small sample size bias using the selected methods discussed above.

Primary studies
The primary studies selected for two major meta-analyses were included in the present study. Sin and Lyubomirsky [22]

Relevant data extraction and coding of primary studies
The selected primary studies used a variety of research designs (e.g., pre-post, post only), included one or more relevant interventions within the same study, and included one or more relevant outcome measures. Only interventions designed to improve well-being and/or decrease depression were considered relevant. Similarly, only measures of well-being and/or depression were relevant. Studies that included more than one intervention often employed only one control condition, which was used to determine the effectiveness of each intervention. Some studies included more than one control condition some of which were designed to decrease well-being and some were designed to increase well-being. Accordingly, we coded control conditions according to their presumed effect on well-being (negative, neutral, positive) and chose the most neutral control conditions to calculate PPI effect sizes. Thus, to calculate PPIs effect sizes, we extracted the following data for each study, intervention, and relevant outcome measure: research design (e.g., pre-post, post only); intervention; outcome measure; sample size of both control and intervention group; overall sample size; means and standard deviations of both pre and post assessments; within condition correlations between pre and post measurements (these were rarely provided); any F, t, p, or effect size (e.g., Cohen's d) statistics reported for post only comparisons between control and intervention conditions; mean differences between pre and post measurements and associated standard deviations; and any other relevant data that allowed for effect size calculations.

Effect size calculations
The primary studies that examined the effectiveness of interventions on well-being and/or depression symptoms used a variety of research designs, including repeated measures, prepost designs, and between subjects post only measures designs. Although it is relatively straightforward to calculate effect sizes (i.e., rs or Cohen's ds) for between subject post only designs using means, standard deviations, Fs, ts, or ps, it is much more challenging to calculate effect sizes for repeated measures pre-post designs [26]. Primary studies using repeated measures pre-post designs rarely report sufficient statistical detail (such as the necessary correlations between pre and post scores), and thus, it is often necessary to impute estimated pre-post correlations using data obtained from other studies. Critically, it is not appropriate to use Fs, ts, or ps to calculate effect sizes using formulae designed for between subject designs (i.e., formulae that do not take into account pre-post correlations). Accordingly, our initial approach was to calculate effect sizes for pre-post repeated measures designs using a formula recommended by Morris [101], specifically, d ppc2 , using means, standard deviations, and when necessary, imputed pre-post correlations. Additionally, effect sizes were calculated using only post means and standard deviations, effectively treating these repeated measures pre-post designs as between subjects post-only designs. However, because the primary studies did not report pre-post correlations for outcome measures, it was not possible to calculate d ppc2 without imputing such correlations from elsewhere for each study. Some primary studies used multiple outcome measures. To ensure that each study only contributed one effect size for each meta-analysis, effect sizes were first calculated for each outcome measure, and then aggregated to yield a single effect size. This was done while taking into account the correlations among the within-study outcomes using methods described by Schmidt and Hunter [102] and imputing a recommended default correlation of r = .50 for between within-study effects [103]. The aggregation of within-study outcomes was done using the R package MAc [104].
Similarly, some primary studies used multiple interventions. Moreover, only some of these interventions were designed within the positive psychology framework to improve well-being and/ or decrease depression symptoms. Thus, effect sizes were calculated for each intervention designed to improve well-being and/or decrease depression symptoms within the positive psychology framework, and resulting effect sizes were aggregated to yield a single effect size from each study. For example, Emmons and McCullough [105] employed three experimental conditions: (a) participants listed things they were grateful for in their life, (b) participants listed hassles they encountered that day, and (c) participants listed events that happened during the week that impacted their life. In this case, the first condition (gratitude listing) was classified as the intervention group and the last condition (event listing) as the control group. As another example, Lyubomirsky, Dickerhoof, Boehm, and Sheldon [69] used three experimental conditions: (a) participants expressed optimism, (b) participants expressed gratitude, and (c) participants listed activities from the previous week. In this case, the first two conditions (optimism and gratitude) were classified as the intervention groups and the third condition was classified as the control group. Subsequently, the effect sizes obtained for the two interventions were aggregated into a single effect size for that particular study using methods recommended by Schmidt and Hunter [102] as described above.
Finally, some primary studies-seven in Sin and Lyubomirsky's (2009) study set and three in Bolier et al.'s (2013) study set-used multiple control or comparison groups, ranging from interventions that may have decreased well-being (e.g., asking participants to reflect on negative experiences), to neutral controls, to interventions that increased well-being. In these cases, the most neutral control was chosen when calculating effect sizes. However, in some cases the control group was not clearly identified. For example, Low et al. [106] included three groups of female patients with breast cancer, who were asked to write about one of three possible options: (a) positive thoughts about their breast cancer experience, (b) deepest thoughts and feelings about their experience with breast cancer, and (c) facts about breast cancer and treatment. The first condition (positive thoughts) was classified as the intervention, which fits within the positive psychology framework, and the last condition (facts about breast cancer and its treatment) was used as the control. Finally, for studies by Cook [38] and Buchanan and Bardi [74], the no intervention controls were chosen over other controls, and for Tkach [107], the condition in which participants described any 3 events, 3 times a day, once a week was selected over other controls.
Effect sizes for primary study outcomes were calculated from available data in the following order of preference: (1) the post intervention means and standard deviations, (2) the post intervention ANOVA F values, (3) the post intervention Cohen's ds, (4) the post intervention p values, and (5) the pre-post difference score means and standard deviations as the difference between intervention and control effect sizes.

Missing data and other irregularities
A number of primary studies included in the previous meta-analyses did not report sufficient data to calculate effect sizes. In the previous meta-analyses, the effects sizes for these studies were imputed to be zero (e.g., [46,57,68]). In the current replication analyses, such studies were excluded unless missing data could be imputed from other relevant sources. For example, if standard deviations for an outcome measure were missing in one study/experiment but were reported elsewhere (e.g., for another study/experiment within the same article), the missing standard deviations were imputed from the available ones to allow the calculation of effect sizes (e.g., Pretorious et al. [108]).
A number of primary studies only reported an overall sample size and did not report the sample size for the control and intervention groups. In such cases, the sample sizes for control and intervention groups were estimated by dividing the overall sample size by the number of control and intervention groups. Lastly, four articles-Shapira and Mongrain [96], Sergeant and Mongrain [99], Mongrain and Anselmo-Matthews [97], and Mongrain, Chin, and Shapira [109]-report on four seemingly different studies but actually report on different conditions/ interventions of the same study. Accordingly, these four articles were treated as a single study.

Statistical analyses
After all effect sizes were calculated, they were pooled to obtain a weighted effect size of PPIs using a random effects model. A random effects model was chosen because true PPI effects are unlikely to be the same and are likely to vary across the interventions, participants, and designs [33,110]. A fixed effect model meta-analysis assumes that all primary study effects estimate one common underlying true effect size. In contrast, a random effect model meta-analysis assumes that primary study effects may estimate different underlying true effect sizes (e.g., a true effect size may vary depending on participants' age and the duration of the interventions).
Heterogeneity-variation or inconsistency found among effect sizes-is expected to be due to chance and to the array of interventions and samples used. Considerable heterogeneity indicates substantial differences between studies. To assess this, two common heterogeneity statistics were calculated: Cochran's Q [111] and I 2 [112]. The Q statistic employs a chi-square distribution k (number of studies)-1 degrees of freedom-and only informs us of whether or not heterogeneity exists; it does not indicate how much heterogeneity exists and it is dependent on sample size. In contrast, the I 2 statistic provides a percentage of total between-study variability found among the effects sizes, where a result of I 2 = 0 means that the variability found among the estimated effects size is due solely to sampling error within studies [113].
Small study effects were assessed by first examining scatter plots, forest plots, and funnel plots. Several methods were used to estimate effect sizes while taking into account small study effects. First, the Trim and Fill procedure was used [32]. Second, a cumulative meta-analysis was used to determine how much the addition of small size studies would change the estimated effect size. Third, the effect sizes were estimated based on the top 10% (TOP10) of the most precise studies [114]. Stanley and Doucouliagos [114] demonstrated that the TOP10, despite its simplicity, performs well in estimating effect sizes in the presence of small sample size bias. Finally, the effect sizes were estimated using limit meta-analysis [115] which is the most sophisticated of the methods developed for estimating effect sizes in the presence of small sample size bias. The limit meta-analysis has been shown to be superior to other available methods including the trim-and-fill methods and selection models methods [116]. Accordingly, we report only the limit meta-analysis results. All analyses were conducted using R [117], including packages compute.es [118], MAc [104] meta [119], metafor [120], and metasens [121].
Following the procedure described in Cooper and Hedges [122], outliers were identified as effect sizes that were at least 1.5 times the interquartile range above the upper quartile or below the lower quartile of the distribution of effect sizes. When outliers were identified, a meta-analysis was re-run after removal of the outliers to assess the impact of outliers on the findings.
Using the method for identifying outliers described by Viechtbauer and Cheung [123] yielded similar results.

Moderator analyses
For the reasons detailed in the introduction, we have not attempted to reanalyze and replicate the moderator analyses published in Sin and Lyubomirsky (2009) and Bolier et al. (2013). Any such moderator analyses would be uninterpretable and not meaningful due to the small number of studies as well as the discrepant number of studies in the moderator groups [36]. Moreover, other issues reviewed in the introduction-most importantly the prevalent small sample size bias and non-comprehensive search for relevant primary studies-would also render any such analyses uninterpretable.

Sin and Lyubomirsky (2009) meta-analysis
Well-being: Reanalysis of reported data. The reanalysis used data reported by Sin and Lyubomirsky [22] in their Table 1. Fig 1 shows the forest plot of effect sizes (rs) as reported by Sin and Lyubomirsky, including total sample size for each study in the "Total" column. The forest plot indicates that small studies resulted in larger effect sizes than large studies. A random effect model estimated an effect size of r = .24 [95% CI = (0.18, 0.30)] with substantial heterogeneity as measured by I 2 = 71.9%. Fig 2, top panel, shows a scatter plot of effect sizes and study sizes. The scatter plot indicates the presence of a small study effect. Fig 2, bottom panel, shows the funnel plot with substantial asymmetry. The regression test of the funnel plot symmetry confirmed that the plot was asymmetrical, t(47) = 4.46, p < .001. Accordingly, we estimated the effect size after accounting for the small study size bias. The limit meta-analyses (Fig 2, bottom panel) resulted in an effect size of r = .08 [95% CI = (0.00, 0.15)]. A test of small-study effects showed Q-Q'(1) = 50.83, p < .001. A test of residual heterogeneity indicated Q(47) = 120.24, p < .001. Thus, taking into account small study effects, the reanalyses resulted in a much smaller estimated effect size for well-being than the effect size (r = .29) reported by Sin and Lyubomirsky [22].
Well-being: Complete replication of meta-analysis. Table 1 reports effect sizes for PPIs on well-being determined as described above for each outcome measure and each intervention comparison. These effect sizes were then aggregated to yield a single effect size for each study comparable to those reported in Sin and Lyubomirsky [22] using the aggregation method described in the Method section. The correlation between the effect sizes reported by Sin and Lyubomirsky [22] and the effect sizes calculated through this replication was high, r = . 78 [22] data, the replication resulted in a much smaller effect size estimate than that originally reported by Sin and Lyubomirsky (r = .29).

Meta-analyses of positive psychology interventions
Depression: Reanalysis of reported data. The reanalysis used data reported by Sin and Lyubomirsky [22] in their Table 2. Fig 5 shows the forest plot of effect sizes. Again, the forest plot indicates that small studies reported larger effects than large studies. A random effect model estimated an effect size of r = .25 [95% CI = (0.14, 0.34)] with substantial heterogeneity as measured by I 2 = 74%. Fig 6, top panel, shows the scatter plot of effect sizes and study sizes. The scatter plot indicates the presence of small study effects. Fig 6, bottom panel, shows the funnel plot with substantial asymmetry. The regression test of the funnel plot symmetry confirmed that the plot was asymmetrical, t(23) = 3.20, p = .004. Accordingly, we estimated the effect size after accounting for the small study size bias. The limit meta-analysis (Fig 6, bottom panel)  Depression: Complete replication of meta-analysis. Table 2 reports effect sizes for studies that assessed depression. The effect sizes were determined as described above for each outcome measure and each intervention comparison. These effect sizes were then aggregated to yield a single effect size for each study comparable to those reported in Sin and Lyubomirsky [22] using the aggregation method described in the Method section. The correlation between the effect sizes reported by Sin and Lyubomirsky [22] and the effect sizes calculated through this replication was high, r = .78 [95% CI = (0.52, 0.91).

Bolier et al. (2013) meta-analysis
Subjective well-being: Reanalysis of reported data. The reanalysis used data reported by Bolier et al. [23] in their Table 2     Subjective well-being: Complete replication of meta-analysis. Table 3 reports effect sizes determined as described above for each outcome measure and intervention comparison. These effect sizes were aggregated to yield a single effect size for each study comparable to those reported in Bolier et al. [23]. The correlation between the effect sizes reported by Bolier et al. [23] and the effect sizes calculated through this replication was high, r = .85 [95% CI = (0.68, 0.94).   Psychological well-being: Reanalysis of reported data. The reanalysis used data reported by Bolier et al. [23] in their Table 2      The regression test of the funnel plot symmetry confirmed that the plot was asymmetrical, t(18) = 2.68, p = .02. Accordingly, it is necessary to estimate the effect size in the presence of the small study size bias. The limit meta-analyses (Fig 14, bottom panel) Table 4 reports effect sizes determined as described above for each outcome measure and each intervention comparison. These effect sizes were aggregated to yield a single effect size for each study comparable to those reported in Bolier et al. [23]. The correlation between the effect sizes reported by Bolier et al. and the effect sizes calculated through this replication was high, r = .88 [95% CI = (0.70, 0.96). Fig 15 shows the forest plot of replication effect sizes. Again, the forest plot indicates that smaller studies reported larger effect sizes than larger studies. A random effect model     Table 2 and Fig 4. Fig 17 shows the forest plot of effect sizes. The forest plots indicates that small studies reported larger effect size than larger studies and it also suggests the presence of outliers. A random effect model estimated an effect size of r = .10 [95% CI = (0.03, 0.16)] with moderate heterogeneity as measured by I 2 = 51.4%. Fig 18, top panel, shows the scatter plot of effect sizes by study size. The scatter plot indicates the presence of small study effects. Fig 18, bottom panel, shows the funnel plot with substantial asymmetry. The regression test of the funnel plot symmetry confirmed the plot was asymmetrical, t(12) = 2.71, p = .019. Accordingly, it is necessary to estimate the effect size in the presence of the small study size bias. The limit meta-analyses (Fig 18, bottom panel) Table 5 reports effect sizes determined as described above for each outcome measure and each intervention comparison. These effect sizes were aggregated to yield a single effect size for each study comparable to those reported in Bolier et al. [23]. The correlation between the effect sizes reported by Bolier et al. [23] and the effect sizes calculated through this replication was high, r = .81 [95% CI = (0.49, 0.94). Fig 19 shows the forest plot of effect sizes and displays no apparent small study size effects. A random effect model estimated an effect size of r = .14 [95% CI = (0.08, 0.21)] with moderate heterogeneity as measured by I 2 = 23.6%.   The effect size estimates were recalculated after the removal of an outlier (Seligman.2006.2). A random effect model estimated an effect size of r = . 14 [95% CI = (.09, . 19)] with no heterogeneity as measured by I 2 = 0%. A regression test of the funnel plot symmetry indicated no statistically significant asymmetry, t(11) = -.17, p = .862. The limit meta-analyses estimated an effect size of r = .15 [95% CI = (.06, 0.24)]. A test of small-study effects showed Q-Q'(1) = .03, p = .862 and a test of residual heterogeneity indicated Q(11) = 11.51, p = .402. The replication analyses indicated a somewhat higher effect for depression than that reported by Bolier et al. [23]. Table 6 summarizes the key findings from our reanalyses of Sin and Lyubomirsky and Bolier et al. meta-analyses. For comparison, it also includes effect sizes (rs) originally reported by Sin and Lyubomirsky and Bolier et al. The table highlights that re-analyses of the data reported in the two previous meta-analyses resulted in much smaller effect sizes than those originally reported. Moreover, of the seven meta-analyses that yielded significant findings in the previously conducted studies, only two remained statistically significant and one more depended on one outlier when reanalyzed in the current study. Table 7 summarizes the key findings from our complete replications of Sin and Lyubomirsky and Bolier et al. meta-analyses. The table highlights that our replications showed generally small effects of PPI on well-being and depression that were comparable to the effects found by our re-analyses of Sin and Lyubomirsky and Bolier et al.'s data.

Discussion
The first meta-analysis examining the effectiveness of the PPIs on well-being, by Sin and Lyubomirsky [22], reported moderate effects on improving well-being and decreasing depression. A second meta-analysis by Bolier et al. [23] focused on randomized trials only and found much smaller effects of PPIs than the first meta-analysis. Bolier et al. attributed their smaller effects to their inclusion of higher quality studies only. However, in addition to the differences in the inclusion criteria, our detailed reading of the two meta-analyses suggested an alternative explanation for the discrepancy in the reported effect sizes. The discrepancy may be due to common methodological issues affecting many published meta-analyses including (a) the failure to weigh studies by their sample size, (b) the failure to describe the calculation of effect sizes in sufficient detail, and (c) the failure to consider and adjust for small sample size bias. Therefore, though Schueller et al. [24] correctly criticized Bolier et al. study because of the unreasonably narrow selection criteria and cautioned against drawing any conclusions from Bolier et al. meta-analysis, there may be additional reasons that warrant caution. Accordingly, our study had two major objectives. First, we reanalyzed the reported data from the two previous meta-analyses while taking into account study sizes and small sample size bias. Second, we replicated both meta-analyses starting with extracting relevant effect sizes directly from the primary studies rather than relying on the data published in the previous meta-analyses. In conducting these meta-analyses, the data were analyzed using a weighted random effects model while taking into account small sample size bias using the selected methods discussed above.  Our reanalysis of the effect sizes reported by Sin and Lyubomirsky [22] revealed much smaller effect size estimates for both well-being (r = .08) and depression (r = .04) than the previous authors reported (r = .29 and r = .31, respectively). There were two major reasons for the inflated estimates reported by Sin and Lyubomirsky. First, Sin and Lyubomirsky reported effect size estimates as simple unweighted averages of study level effect sizes (i.e., they averaged rs across the studies included in their meta-analysis). This approach is inappropriate because it gives equal weight to small-and large-size studies [26]. Second, Sin and Lyubomirsky noted that their effect sizes resulted in asymmetric funnel plots, but they used Fail Safe N to conclude that small-study effects did not significantly inflate their findings. However, the Fail Safe N is no longer considered an appropriate way to assess small-study effects [26]. The present study's reanalysis confirmed that the funnel plots were asymmetric for both well-being and depression, and the random effects limit meta-analysis estimates are much smaller (and not statistically significant for depression) due to small-study effects. The replication of Sin and Lyubomirsky [22] meta-analyses revealed relatively high correlations between effect sizes determined by the current study and by those in the previous study for both well-being and depression. Consistent with the similar effect sizes extracted from the primary studies, the replication analyses and estimated effect sizes for well-being and for depression were very similar to those obtained by our reanalyses of effect sizes reported by Sin and Lyubomirsky. The replication analyses resulted in nearly the same findings as those from the reanalyses even though several studies that did not report essential data to calculate effect sizes were excluded from the replications.
Our reanalysis of the effect sizes reported by Bolier et al. [23] revealed the same estimated effect size for subjective well-being (r = .17) as reported by Bolier et al. However, the estimated effect sizes for psychological well-being (r = .02), and depression (r = .02) were smaller (and no longer statistically significant) than originally reported in Bolier et al. (r = .10, and r = .11, respectively). When outliers were removed, the estimated effect sizes for psychological wellbeing were r = .01 and for depression were r = .07. The latter result is partially attributable to the test of funnel plot asymmetry being no longer statistically significant, in part due to the smaller number of effect sizes included. However, the limit meta-analysis estimated the effect size for depression after the removal of outliers as r = .03. The replication of Bolier et al. [23] meta-analyses revealed relatively high correlations between effect sizes determined by the current study and those reported in their meta-analysis for subjective well-being, psychological well-being, and depression. Despite the removal of several original studies (due to insufficient data to calculate effect sizes), the results of the replication analyses of subjective well-being and psychological well-being were very similar to those obtained by the reanalyses. The replication of depression effects resulted in slightly larger estimated effect sizes of r = .14. However, these results need to be viewed with caution as they are based on a small number of studies. Moreover, even though the small-study effects were not statistically significant, the number of studies was small and the scatter plots of effect sizes and study sample sizes show that large-size studies resulted in substantially smaller effects than small size studies.
In summary, the reanalyses and replications of Sin and Lyubomirsky [22] and Bolier et al. [23] indicate that there is a small effect of approximately r = .10 of PPIs on well-being. In contrast, the effect of PPIs on depression was nearly zero when based on the studies included in Sin and Lyubomirsky [22] and highly variable, and sensitive to outliers, when based on studies included in Bolier et al. [23]. Notably, Sin and Lyubomirsky [22] included nearly twice as many studies as Bolier et al. [23] in their meta-analysis of the effects of PPIs on depression.
Our review of the two highly cited meta-analyses of PPIs resulted in a number of secondary findings and implications. First, the major reason for the larger effects reported in previous meta-analyses was that these studies did not appropriately account for prevalent small-study effects. The small-study effects are a frequent problem with meta-analyses in many fields and a number of methods (e.g., cumulative meta-analysis, TOP10, limit meta-analysis) have been developed to estimate effect sizes in the presence of small-study effects. Unfortunately, these methods were not employed in the previous meta-analyses addressed by the current study. Given the presence of the small-study effects, future meta-analyses of PPIs must take into account small-study effects using appropriate estimation methods.
Second, these findings are tentative because the previous meta-analyses did not include all available studies. To illustrate, Bolier et al.'s [23] inclusion criteria are restrictive because they excluded (a) all relevant studies published prior to the coining of the term "Positive Psychology", (b) all studies of effects of mindfulness and meditation on well-being, and (c) all studies that did not explicitly mention "positive psychology". As pointed out by Schueller, et al. [24], Bolier et al.'s inclusion criteria are too narrow and exclude numerous studies that use the same interventions and same outcome measures. If a substantial number of relevant studies were not included, the findings based on only a small sample of relevant studies may not reflect the cumulative findings across the population of previous studies. In turn, not conducting a comprehensive search for primary studies also reduces meta-analysists' ability to conduct meaningful moderator analyses [24].
Third, the failure to include all available studies in the previous meta-analyses suggests the need for a comprehensive meta-analysis of PPIs effect on well-being starting with a comprehensive search for relevant studies. A preliminary search using PsycInfo for studies of PPIs using only the most obvious search strategy (search for all studies mentioning both "positive psychology" and at least one of the terms "intervention", "therapy", or "treatment") yielded over 200 relevant studies, more than tripling the number of studies included in the previous meta-analyses.
Fourth, our review of the primary studies included in the previous meta-analyses revealed persistent limitations with their method and results sections. In general, no primary studies with pre-post designs reported pre-post correlations for outcome measures, which are necessary to calculate the most appropriate effect sizes [101]. Though the authors of a number of these primary studies were contacted by email, they did not provide these correlations. As a result, the current study relied primarily on the post data only, following the approach adopted by Bolier et al. [23]. Accordingly, these findings suggest that researchers need to report all necessary statistical information to facilitate future replication and/or meta-analyses. Although numerous guidelines have been provided for reporting the results of studies such as JARS [124], researchers appear slow to adopt them, and the present findings suggest the need to push for adoption of such guidelines by researchers in the PPI field.
Fifth, it is evident from the diverse inclusion and exclusion criteria of previous meta-analyses that there is no consensus as to what constitutes a PPI. Bolier et al. [23] excluded interventions that others consider PPIs (e.g., mindfulness and meditation). Bolier et al. even speculated that different inclusion criteria and differences in study designs were the reason for discrepancies between their findings and those of Sin and Lyubomirsky [22]. However, the current reanalysis casts doubt on this explanation as the findings were comparable when small-study effects were taken into account. The definition of a PPI is critical for determining which studies to include in future meta-analyses. Schueller et al. [24] argued that including only studies that mention "positive psychology" would miss many 'positive intervention' studies. Similarly, Parks and Biswas-Diener [125] acknowledged that it can be arduous to define interventions that are aimed at increasing the 'positives'. Clearly, this needs to be addressed in the near future.
Thus, the "true" effects of PPIs may be substantially different from what Sin and Lyubomirsky and Bolier et al. meta-analyses indicate. While our re-analyses and replications of these meta-analyses converge and indicate that the effect of the PPIs are relatively small when small sample bias is taken into account, estimates of effect sizes are not definitive because neither Sin and Lyubomirsky nor Bolier et al. meta-analyses were comprehensive and a large number of relevant studies are likely missing.
Accordingly, a comprehensive and transparent meta-analysis of all relevant studies of PPIs is necessary and is likely to have a major influence on the field. Such a meta-analysis is likely to allow for meaningful moderator analyses in answering questions such as: Is group administration more effective than individual administration? Are longer interventions more effective than shorter interventions? Are some types of interventions more effective than other types of interventions? Importantly, a comprehensive meta-analysis is likely to provide a more definitive determination of how effective PPIs are at increasing well-being.
Given that our meta-analyses indicate that the effects of the PPIs on well-being and depression may be smaller than previously reported, future research may need to employ strategies likely to increase the effectiveness of PPIs. For example, PPIs are likely to be more effective if they are deployed over longer periods of time [126]. Some researchers have criticized the use of single short duration PPIs in some areas [127] and others have argued that PPIs ought to be deployed over longer periods of times [128] as was done in only some of the PPI studies [129]. Moreover, the effectiveness of the PPIs may depend not only on overall duration but also on frequency of PPIs. Finally, it may be that a combination of two or three PPIs (e.g., the combination of best possible self and gratitude letters) is more effective than a single type of PPI of equal duration [130].

Conclusions
The current study re-analyzed the data reported in previous meta-analyses that examined the effectiveness of PPIs on increasing well-being and decreasing depression, as well as completely replicated (extracting data from original sources) the previous meta-analyses. The reanalysis of the previously reported data showed that although correlations between the recalculated effect sizes and the previous meta-analyses effect sizes were fairly high (suggesting that the same data were extracted), the effect sizes were lower than previously reported and often nonsignificant. The major contributing factor for this discrepancy was that the present study accounted for the strong presence of small-sample size bias. Critically, both meta-analyses reviewed, did not include a large number of relevant studies, and thus, effect sizes estimated from their sample of primary studies need to be confirmed by future, more comprehensive, meta-analyses. Accordingly, a comprehensive and transparent meta-analysis of all relevant studies of PPIs is necessary. Such a meta-analysis will allow for meaningful moderator analyses to determine effects of various PPIs including whether individual PPIs are more effective than group PPIs and whether longer and more intense PPIs are more effective than shorter and less intense interventions. Our research underscores that any future meta-analyses of PPI effectiveness ought to take into account frequent methodological issues such as prevalent small sample size bias.