Responsible product design to mitigate excessive gambling: A scoping review and z-curve analysis of replicability

Objectives Systematic mapping of evaluations of tools and interventions that are intended to mitigate risks for gambling harm. Design Scoping Review and z-curve analysis (which estimates the average replicability of a body of literature). Search strategy We searched 7 databases. We also examined reference lists of included studies, as well as papers that cited included studies. Included studies described a quantitative empirical assessment of a game-based (i.e., intrinsic to a specific gambling product) structural feature, user-directed tool, or regulatory initiative to promote responsible gambling. At least two research assistants independently performed screening and extracted study characteristics (e.g., study design and sample size). One author extracted statistics for the z-curve analysis. Results 86 studies met inclusion criteria. No tools or interventions had unambiguous evidence of efficacy, but some show promise, such as within-session breaks in play. Pre-registration of research hypotheses, methods, and analytic plans was absent until 2019, reflecting a recent embracement of open science practices. Published studies also inconsistently reported effect sizes and power analyses. The results of z-curve provide some evidence of publication bias, and suggest that the replicability of the responsible product design literature is uncertain but could be low. Conclusion Greater transparency and precision are paramount to improving the evidence base for responsible product design to mitigate gambling-related harm.


Search strategy
We searched 7 databases. We also examined reference lists of included studies, as well as papers that cited included studies. Included studies described a quantitative empirical assessment of a game-based (i.e., intrinsic to a specific gambling product) structural feature, user-directed tool, or regulatory initiative to promote responsible gambling. At least two research assistants independently performed screening and extracted study characteristics (e.g., study design and sample size). One author extracted statistics for the z-curve analysis.

Results
86 studies met inclusion criteria. No tools or interventions had unambiguous evidence of efficacy, but some show promise, such as within-session breaks in play. Pre-registration of research hypotheses, methods, and analytic plans was absent until 2019, reflecting a recent embracement of open science practices. Published studies also inconsistently reported effect sizes and power analyses. The results of z-curve provide some evidence of publication bias, and suggest that the replicability of the responsible product design literature is uncertain but could be low.

Introduction
Interventions and tools for the safe use of inherently risky consumer products take many forms. For example, cars have mandatory structural features that autonomously mitigate the effects of accidents (e.g., airbags and crumple zones), include optional user-directed tools that assist with safe driving practices (e.g., turn signals and seat belts), and are subject to regulations that promote safe driving at large (e.g., minimum age of operator requirements and speed limits). Understanding the strength of evidence for various safety features and interventions can help stakeholders decide which to implement. There is concern that popular gambling products, especially electronic gaming machines and internet gambling platforms, include features that increase risky gambling behavior [1,2]. For instance, the ability to prematurely stop the reels of a video slot machine that has predetermined outcomes might give users an illusion of control over the outcome, motivating them to play longer [3]. Researchers have called for a greater emphasis on implementing safety features and interventions for gambling products [4]. However, it would be premature to make implementation recommendations without first determining whether existing evidence is based on sound research practices. Here, we report findings from a scoping review that quantitatively summarizes key features of existing research on game-based responsible gambling tools and interventions [5]. We identify trends in how studies are conducted, the state of knowledge about each type of tool, and whether a formal meta-analysis would be valuable. We also use the main result from each study to estimate the replicability of research on product safety in gambling.
that have not been thoroughly vetted might have harmful effects that outweigh any intended benefits [10]. Furthermore, the implementation of tools that appear to have promise but are in fact ineffective could provide industry actors with unwarranted moral cover from advocates of more invasive interventions [11].

Existing reviews of responsible product design for gambling
There have been several qualitative reviews of the empirical evidence for responsible gambling interventions as of 2015 (for an umbrella review, see [12]. These include reviews of structural features in electronic gambling [13], user-directed tools, such as self-exclusion [14], government and industry initiatives [15], product safety tools within real gambling environments [16], and EGM warning messages [17]. Although the reviews vary in their takeaway messages, they all stress that existing studies are limited by (a) relying on retrospective self-report, (b) using observational methods without incorporating features that address threats to causal inference, or (c) studying non-gamblers in laboratory settings. A more basic desideratum is whether published studies have yielded replicable findings. In addition to systematically charting the characteristics of available product safety evaluations, our review makes a unique contribution by focusing on replicability. Because opacity in how studies were conducted and analyzed undermines replicability, we also quantify the transparency of the responsible product design literature.

Replicability of responsible product design research
Collaborative efforts to estimate the replicability of studies published in eminent journals, [18,19] as well as the increasing number of individual replication attempts [20], have undermined confidence in numerous foundational findings in the social sciences. Central reasons for poor replicability include low statistical power [21], undisclosed methodological decisions that artificially inflate type I error rates (often called "questionable research practices" or "researcher degrees of freedom"), and publication bias abetted by incentives for novel, positive findings [22].
Researchers can use z-curve to estimate a literature's replicability [23]. Z-curve estimates the mean power of a set of studies with significant effects. Because we do not know which studies test true alternative hypotheses, "power" here does not refer to probability of obtaining a significant result conditional on the null hypothesis being false. Instead, power is the unconditional probability of obtaining a significant result, or "the percentage of significant results if the original studies were replicated exactly" (p.13). To our knowledge, researchers have not yet applied z-curve analyses to any segment of the gambling literature. Brunner and Schimmack [23] find that z-curve outperforms p-curve, p-uniform, and maximum likelihood estimation in estimating mean power of a set of studies selected for significance when there is heterogeneity in effect sizes (pgs. [12][13]. We expect heterogeneity in effect sizes because different researchers are studying the effects of different types of interventions. Transparency of responsible product design research. Because many replicability issues are due to a lack of transparency about analytic decisions, we also coded the transparency of each study along several dimensions. First, we employ a broad conceptualization of potential conflicts of interest by coding not only for funding sources, but also for the presence of a conflict of interest statement and a disclosure of funding sources from the past five years. Second, we code for whether the study was pre-registered and, if so, whether a link to the pre-registration is available in the manuscript. Pre-registration is the process by which researchers preempt questionable research practices by publicly documenting their hypotheses, methods, and analytic strategies prior to commencing a study [24]. Third, we code for whether the study contains a power analysis and, if so, whether it was computed a priori or post-hoc. Conducting and reporting an a priori power analysis incentivizes researchers to conduct adequately powered studies, which in turn increases the likelihood that significant effects reflect true effects [25]. Fourth, we code for whether the study accompanies the test statistics with effect sizes and, if so, whether such effects were unstandardized or standardized. Effect sizes help readers understand whether authors' qualitative description of an effect's practical importance is consistent with its actual magnitude [26]. Effect sizes also can contain information about (a) replicability, because tests of larger effects more often yield significant results, and (b) fidelity of the research process, because honestly reported tests of non-trivial hypotheses typically yield medium-to-small effect sizes [27,28].

Materials and methods
We drafted our research protocol using the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) [5]. The pre-registration for this project, as well as transparent changes to the pre-registration, are available on the Open Science Framework (osf.io/m3nju/files/).

Study inclusion criteria
We included studies if they (a) were peer-reviewed, (b) were published at any point prior to our search, (c) were written in English, and (d) describe a quantitative empirical assessment of a game-based (i.e., intrinsic to a specific gambling product, rather than restrictions or tools that are intended to reduce gambling harm across gambling products) structural feature, userdirected tool, or regulatory initiative to promote responsible gambling. We specified the first three criteria during our sample acquisition and the fourth criterion during our title and abstract inspection and full-text inspection.

Information sources and search strategy
A PRISMA diagram (see Fig 1) displays a summary of our search for studies that meet the inclusion criteria. To identify potentially relevant studies, on February 5, 2020, we searched the following bibliographic databases covering a variety of scientific disciplines: Medline, Embase (medicine); PsycARTICLES, PsycINFO (psychology); Global Health (public health); the Education Resources Information Center [ERIC] (education); and the Social Science Premium Collection.
We searched abstracts for the following keywords: gambl � , betting, wager � , responsib � , regulat � , protect � , warn � , structural, and product safety. We used the following search combinations: (gambl � OR betting OR wager � ) AND (responsib � OR regulat � OR protect � OR warn � OR structural OR "product safety"). Once we specified our initial sample of studies by employing the first three inclusion criteria during a database search, we eliminated duplicates resulting from databases containing overlap in their results. Then, three research assistants screened the titles and abstracts of 10% of all non-duplicates to assess whether they described an empirical test of a game-based intervention (Krippendorff's alpha = .77). When it was not clear from the title and abstract alone whether a paper met inclusion criteria, we retained the full text for inspection. Afterwards, research assistants resolved disagreements through discussion with the first author.
Next, research assistants divided the remaining retrieved studies into three groups and screened their titles and abstracts independently. After reading the full texts, the first author deemed 11 studies as irrelevant that research assistants had flagged as meeting inclusion criteria. Our analytic sample consisted of all of the eligible studies from the database search (N = 43), as well as (N = 23) studies that the first author found between February 2020 and May 2020 after examining the reference lists of previous reviews, the studies that met inclusion criteria, and studies that cited the included studies according to Google Scholar. As a final quality check, the first author examined the abstracts that the research assistants indicated were irrelevant and found 12 that in fact met inclusion criteria. The final sample consisted of 78 journal articles with 86 relevant studies. See S1 Table.

Data charting process
Raters charted the studies on the data items listed in Table 1 using Google Forms. We used an iterative process to determine the reliability of our charting. Two raters independently charted the data from a randomly selected subset of articles representing 10% of our eligible studies from the database search. Because the two coders' interrater reliability did not meet our standard (i.e., Cohen's κ is at least 0.70 and percentage agreement is at least 80% [29] after four iterations, we amended our pre-registration such that both raters charted all studies. The basis on which the first author resolved each discrepancy is available at https:// osf.io/fa539/.

Data charting process for study funder
We used funding sources' websites to code whether studies had ties to the gambling industry [30]. We counted a study as having direct gambling industry funding if any of the listed funding sources were directly part of the gambling industry (e.g., private companies such as Aristocrat Leisure Industries, and nationalized companies such as Loto-Q) or were nonprofits funded by the gambling industry (e.g., International Center for Responsible Gaming). Non-industry funders included government agencies, universities, and private foundations. The websites on which we based our categorization decisions are available at https://osf.io/6sg9d/.

Analytic strategy and synthesis of evidence
We examined separate cross-tabulations of intervention type (i.e., structural feature, userdirected tool, or regulatory initiative) with study funder, study design, and registration status. We also provided a narrative summary of the major findings related to the effectiveness of identified safety characteristics, organized by intervention type. Finally, we calculated a yearby-year summary count of the number of publications by intervention type. See https://osf.io/ 76wm4/ for the syntax to conduct these analyses.

Z-curve
We used the z-curve package in R to conduct z-curve analyses [31]. Z-curve is based on the idea that a distribution of z-scores can be derived from the average power of an entire set of studies. That distribution is truncated at the critical z-value (typically 1.96) after selection for statistical significance. Z-curve takes as input the set of significant findings (to mimic the editorial process of publishing only positive findings) and uses this truncated distribution to estimate the most likely shape of the non-truncated distribution of the population represented by the significant studies. To account for heterogeneity in effect sizes and power, z-curve estimates the distribution of all conducted studies using a finite mixture model of seven distributions, centered on z-scores of 0,1, 2, 3, 4, 5, and 6, respectively. An expectation maximization algorithm is used to assign studies probabilities of belonging to each distribution [32]. The resulting estimate of the non-truncated z-score distribution enables the computation of several statistics. First, the area under the curve to the right of the significance criterion is the Estimated Discovery Rate, or the estimated proportion of all studies that have been conducted that had significant results. The Observed Discovery Rate represents the proportion of coded tests that had significant results in the hypothesized direction. Because our dataset represents the entire population of interest, we omit confidence intervals from our reports of the Observed Discovery Rate. Evidence for publication bias exists if the Observed Discovery Rate is higher than the upper confidence limit of the Estimated Discovery Rate.
The Estimated Discovery Rate can be used to estimate how many non-significant results there might be for each significant result. This "file-drawer ratio" is equal to the estimated proportion of non-significant results (1-Estimated Discovery Rate) divided by the Estimated Discovery Rate. The file-drawer ratio can in turn be used to compute the False Discovery Risk, or the maximum proportion of significant studies that could represent false positives. The False Discovery Risk equals the product of the file-drawer ratio and the ratio of alpha (viz., .05) to 1-alpha.
Finally, the Expected Replication Rate is the mean power of the non-truncated distribution, and represents the estimated proportion of significant studies that would yield another significant effect if subjected to a direct replication. However, commentators frequently point to differences between original studies and replication studies that could explain why the former yield larger effect sizes than the latter [33,34]. Consequently, the Expected Discovery Rate should more accurately predict the outcome of actual replication efforts than the Expected Replication Rate.
We included all 78 studies that contained inferential tests of their key hypothesis in the zcurve analysis. Two articles used the same dataset; we included only the first publication in zcurve, as the second article examined moderators of the findings reported in the first article.
If studies did not report exact p-values for significant effects, we computed them based on the sample size and either the descriptive statistics or test statistic using either base packages functions (e.g., the pf function for the F distribution) or the compute.es package in R [35]. We contacted authors for this information when they did not include the minimally sufficient information in the paper. Two studies reported at least one p-value as less than .001, and we were unable to manually compute an exact value. After failing to receive clarification from the original authors, we treated these p-values as .0009 in the analysis.
We assigned studies with non-significant results a p-value of .300 if they did not report exact value and we could not reconstruct the exact p-value from the results reported in the paper (n = 8). The value of non-significant results does not impact the outcome of z-curve. We also computed exact p-values for studies that reported p-values to fewer than three decimal places. We recorded the exact p-values of significant effects in the opposite direction of what was hypothesized (n = 3) but treated them as non-significant (arbitrarily assigning them p = .300 for the purposes of the z-curve analysis). In a sensitivity analysis, we excluded studies that used a significance criterion other than a two-tailed alpha of .05 to evaluate the p-value of interest (n = 8), as z-curve's model of publication bias is based on censoring z-scores smaller than 1.96.
The zcurve function assumes that it is possible to identify a single key hypothesis test in each study. We anticipated that many studies in our review would regard multiple hypotheses as of equal importance or report multiple tests of the same hypothesis (e.g., using slightly different measures to represent the same dependent variable). We used the following strategies to select the "most focal hypothesis test": (a) In cases where authors tested the effect of interventions of varying dosage, we treated the test of the strongest intervention vs. the control condition as the most focal hypothesis test (e.g., if the effect of a short break and the effect of a long break are each compared to a no-break control condition, we would regard the comparison between taking a long break and taking no break as most focal); (b) When there were multiple dependent variables that were equally relevant to the central hypothesis, we randomly chose which test to regard as most focal; (c) When not all hypotheses were relevant to promoting responsible gambling, we only treated the hypotheses relevant to responsible gambling as candidates for the most focal hypothesis.
We also conducted a sensitivity analysis in which we repeated the analysis ten times, each time randomly selecting which test from each study to regard as focal by using a different seed number in R. After discovering that a small number of studies had a very large number of focal tests, we limited the number of potentially focal tests to six per study. To estimate upper-and lower-limits of replicability, we also re-reran z-curve once using the highest p-value from each study, and once more using the smallest p-value available from each study. The z-curve dataset is available at https://osf.io/acf3r/; the syntax we used to manually compute p-values and conduct z-curve is available at https://osf.io/aj6eu/.

Characteristics of included studies
See Table 2 for the main characteristics of each study. We also created a Characteristics of Included Studies table that fully summarizes each study in terms of the charted items (see https://osf.io/k9sbq/).

Study design
Of the 86 included studies, 97.7% (n = 84) of studies included at least one statistical test of the association between the game-based tool or intervention type and a gambling outcome. We observed that 69.8% (n = 60) of all studies were experimental, 91.7% (n = 55) of which randomly assigned participants to condition (i.e. 'true' experiments). The other 5 studies were quasi-experimental in that they contained multiple conditions but did not randomly assign participants to conditions. Also, 30.2% (n = 26) of all studies were observational. Among these   observational studies, 34.6% (n = 9) were cross-sectional, 46.2% (n = 12) were retrospective cohort studies, 7.7% (n = 2) were prospective cohort studies, and 11.5% (n = 3) were case series.    68.9% of these studies recruited gamblers from a convenience population (n = 42)-that is, from a nearby casino, advertisement in a local newspaper, online panel pre-screened for gamblers, gamblers in an undergraduate psychology student pool, etc. The rest (31.1%; n = 19) sampled gamblers from a gambling platform or casino loyalty program. Among the 29.4% (n = 25) of studies that did not exclusively sample gamblers, 48% (n = 12) sampled community members and 52% (n = 13) sampled university students. Across all studies, 8.1% (n = 7) of studies used screening tools during enrollment to sample at-risk gamblers, and 11.6% (n = 10) used screening tools during enrollment to exclude at-risk gamblers from participation.

Gambling concepts
Of the 86 included studies, 94.2% (n = 81) of studies measured gambling participation. 73.3% (n = 63) of studies measured the presence or severity of gambling-related problems. Included in this count are measures of gambling-related problems specifically, as well as measures of constructs that might be symptomatic of them, such as impulsivity. One study measured recall of the content of gambling messages, which we did not count as a measure of either gambling participation or gambling-related problems.

Measurement method
Many studies relied on multiple measurement methods. 76.7% (n = 66) of studies used at least one self-report measure. 61.6% (n = 53) of studies used gambling records to measure at least one construct. Two studies used proxy reports (in these cases, by trained observers), and one used financial records.
We conducted unplanned chi-square tests to explore whether the choice to measure gambling participation or gambling-related problems might have influenced the measurement method. There were significantly more studies of gambling-related problems that used selfreports than would be expected by chance, χ 2 (1) = 49.10, p < .001, φ c = .48. There was a nonsignificant tendency for studies of gambling participation to not use self-report measures, χ 2 (1) = 0.52, p = .469, φ c = -.08. The non-significance of this pattern of results held when using a Fisher's exact test [111] to account for expected cell counts with five or fewer cases. A significantly higher proportion of gambling participation studies used gambling records, χ 2 (1) = 5.98, p = .014, φ c = .19, and we obtained the same result using a Fisher's exact test. Finally, fewer studies of gambling-related problems used gambling records than would be expected by chance, χ 2 (1) = 4.70, p = .031, φ c = -.16.

Follow-up period
Of the 86 included studies, 29.1% (n = 25) of studies included a "follow-up" component (i.e., at least one measurement occasion beyond the day on which the intervention was implemented). The median follow-up length among studies that had a follow-up component was 60 days (minimum = 1 day, maximum = 1 year).

Current and past funding sources
Some articles reported multiple sources of funding. Of all 78 included articles, 26.9% (n = 21) had direct funding from the gambling industry, 10.4% (n = 8) had university funding, 3.8% (n = 3) had funding from a private foundation, 39.7% (n = 31) had funding from a government agency, 10.3% (n = 5) received no funding, and 25.6% (n = 20) did not provide enough information about funding source to code. Of government-funded articles, 25.8% (n = 8) were from an agency that is funded by revenue from gambling.

Conflict of interest statement
Of the 78 articles, 20.5% (n = 16) reported conflicts of interest, 30.8% (n = 24) reported that they had no conflicts of interest, and 48.7% (n = 38) did not include a conflict of interest statement. Only two studies provided an explicit statement about all sources of funding that authors had received in the past five years.

Pre-registration status
Of the 86 included studies, 93% (n = 80) of studies made no mention of a pre-registration. 5.8% (n = 5) of studies had a pre-registration that we were able to access. An additional study mentioned a pre-registration, but the hyperlink did not work. All six studies that mentioned a pre-registration were published in either 2019 or 2020.

Power analysis
Of the 84 studies that included a statistical test, 7.1% of studies (n = 6) reported an a priori power analysis. Another 6.0% (n = 5) reported a post hoc power analysis. 87.2% (n = 75) of studies did not report a power analysis justifying sample size.

Effect size
Of the 84 studies that included a statistical test of an intervention on gambling, 64.3% (n = 54) reported at least one effect size. Of those, 37.0% (n = 20) reported at least one unstandardized effect size, 77.8% (n = 42) reported at least one standardized effect size, and 9.3% (n = 8) reported at least one unstandardized effect and at least one standardized effect size. Because of the very large number of tests reported in several studies, many of which were not related to the evaluation of responsible product design per se, we abandoned our pre-registered plan to "report what percentage of the test statistics we transcribe are accompanied by an effect size."

Unplanned coding of sampled populations
While charting studies, we noticed that most studies were conducted in a small number of countries. To follow up on these anecdotal observations, we charted the country or countries from which each study sampled. The most commonly sampled countries were Australia (22.1%, n = 19), Canada (20.9%; n = 18), and the United States (10.5%, n = 9). There were six studies (6.9%) that sampled several countries, usually all from Europe, but in some cases from multiple continents. There were eight studies (9.3%) where authors did not specify the country (ies) where the research was conducted. Though studies examining causes of excessive gambling in Asian, African, or South American countries exist [e.g., 112,113], none of the included studies sampled these populations.
We also formed the impression that most studies did not discuss threats to generalizability based on participant characteristics. Consequently, we transcribed statements in each article's discussion section about potential limitations based on the sampled population. 15.1% of studies (n = 13) explicitly mentioned limitations based on country or cultural milieu (e.g., ethnic group, socioeconomic status, etc.). Many discussion sections that did not mention cultural constraints did discuss how university students might differ from the general population, how sampling players who favor a certain game might have impacted results, or how results from low-risk gamblers might not extend to high-risk gamblers.

Game-based tool or interaction type
Some studies (n = 7) tested multiple types of responsible product designs. Of the 86 studies, 61.6% of studies (n = 53) tested structural tools, 41.9% (n = 36) tested user-directed tools, and 4.7% tested (n = 4) regulations. We used chi-square tests of independence to examine whether there is a relationship between intervention type and study funder, study design, or registration status. We excluded studies that investigated multiple types of tools (n = 7) from this analysis.
Intervention type had a non-significant association with pre-registration status, χ 2 (4) = 9.20, p = .056, φ c = .24. The test was significant in an exploratory follow-up using Fisher's exact test to account for low expected cell counts, p = .027. This potential effect was driven by all pre-registered studies testing user-directed tools.
Intervention type significantly varied by study design, χ 2 (2) = 13.73, p = .001, φ c = .42. Standardized residuals were significant (greater than 2 [111]) for structural features and regulations, but not user-directed tools. Structural tools were tested more often via experiments (n = 37) than by observational methods (n = 9). User-directed tools were tested by observational methods (n = 13) almost as often as by experiments (n = 16). Regulations were tested only in observational studies.
Last, we found that intervention type did not vary by industry-funded research status, χ 2 (2) = 1.69, p = .429, φ c = .15. The non-significance of this pattern remained when we ran an unplanned analysis that counted studies (n = 9) sponsored by government agencies that are funded by earmarked tax revenue from gambling operators (e.g., research funded by Gambling Research Exchange Ontario) as industry-funded research.

Narrative review
Structural feature tools. Our charting suggested that the gambling research literature has examined three main structural feature tools: pop-up messages, breaks in play, and covert structural tools.
Pop-up messages. In all, 41 of 53 structural feature tool studies (77%) examined pop-up messages. Perhaps the most well-studied pop-up warning message seeks to educate participants of the statistical principles that explain why the expected value of gambling is negative. However, studies of so-called self-appraisal messages that encourage gamblers to reflect on whether their current gambling behavior is consistent with their goals are also common.
There were 20 pop-up message studies (49%) suggesting that pop-up messages had a favorable responsible gambling impact. For example, Jardin and Wulfert [74,75] compared popups that remind participants about the chance-based nature of gambling to pop-ups with trivia and a control condition with no pop-ups. Participants who received reminder messages lost less money and made fewer bets than those who read trivia or did not see a pop-up.
We observed that 7 pop-up message studies (17%) suggested pop-up messages had no responsible gambling impact. For example, Lavoie and Main [80] presented pop-ups just before playing slots which warned that gambling can produce a state of immersion that can cause excessive spending. In a second study, they presented pop-ups in the middle of blackjack to inform the user how long he or she had been playing. The pop-ups did not reduce immersion or time or money spent gambling in either study.
Finally, our charting indicated that 14 pop-up message studies (34%) suggested pop-up messages had mixed impact. For example, Tabri, Hollingshead, and Wohl [102] had participants set money limits, and varied whether participants (a) received a single warning message asking whether they would like to continue when the limit was reached, or (b) also received a message when they were close to reaching their limit. Messages about approaching limits increased stopping of play before participants reached their limits. This effect was strong (i.e., an odds ratio above 31) for participants who did not have a financially focused self-concept (i.e., those who do not define themselves in terms of financial success), but was non-significant for those with a financially focused self-concept (i.e., those who self-worth is tied up in financial success).
Breaks in play. In all, 7 of 53 structural feature tool studies (13%) examined breaks in play. The typical justification for mandatory pauses in between rounds or after playing for a certain duration is that gambling induces in at-risk players a dissociative state that undermines rational decision-making [114,115]. In games of chance, the break in play would purportedly mitigate excessive gambling by lifting players out of their trance. In games of skill, a forced pause would give losing players the time to reevaluate their strategy.
There were 3 breaks in play studies (43%) suggesting favorable responsible gambling impact. For example, people with [103] and without gambling-related problems [59] from Wales played card games in which winning became less likely over time. Imposing a 5-second pause between bets reduced the number of rounds played and the magnitudes of monetary losses.
We observed that 3 breaks in play studies (43%) had no or unfavorable responsible gambling impact. For example, one experiment [50] varied whether Australian university students received no break, a 3-minute break, or an 8-minute break from blackjack. The results indicated that both 3-minute and 8-minute breaks increased cravings to gamble and did not decrease dissociative feelings.
Finally, one breaks in play study (14%) suggested mixed responsible gambling impact [94]. The authors found that a three-minute break did increase the response latency between rounds for EGM players in the face of consistent losses. However, this slowed play did not translate into playing fewer trials.
Covert structural tools. In all, 6 of 53 structural feature tool studies (11%) examined covert structural tools. Covert interventions are intended to affect the proximate causes of excessive gambling without requiring buy-in from the gambler. One covert intervention study (17%) reported no responsible gambling impact [37]. The authors examined whether an implicit prime of analytic thinking, or a stimulus that is designed to induce a reflective mindset without the participant realizing that the stimulus has this effect, would attenuate erroneous gambling beliefs and gambling intensity in a sample of EGM players. Randomly assigning participants to unscramble sentences that either did or did not include words related to rationality had no significant effects on gambling beliefs or behavior.
Our charting indicated that 5 covert intervention studies (83%) yielded mixed responsible gambling impact. For example, Sharpe and colleagues [99] modified EGMs in Australian venues with varying levels of reel speeds, maximum bet restrictions, and restrictions on the maximum banknote accepted. Limiting the maximum bet to $1 reduced the number of bets and losses relative to EGMs with $10 bet maximums, but modifying note acceptors and reel speeds did not reduce gambling activity.
User-directed tools. Our charting suggested that the gambling research literature has examined two main user-directed feature tools: precommitment and information aids.
Precommitment. In all, 16 of 36 user-directed tool studies (44%) examined precommitment. Precommitment involves prospectively planning to restrict one's own ability to gamble excessively. Its efficacy is premised on players recognizing that they have difficulty exercising selfcontrol "in the heat of the moment." Our charting revealed 6 precommitment studies (38%) finding that precommitment had a favorable responsible gambling impact. Brevers and colleagues [53] presented participants with a gambling task with four options that varied in reward and risk. During "precommitment" trials, there was a preliminary step in which participants could remove the high-risk, high reward options from the trial. In the control trials, all four options were always available. Participants eliminated the high-risk options in about half of precommitment trials, resulting in lower-risk decisions in the precommitment condition.
We observed 4 precommitment studies (25%) suggesting that precommitment had no or unfavorable responsible gambling impact. In arguably the best-designed study that met inclusion criteria [11], researchers manipulated whether Finnish online gamblers were presented with a prompt to consider setting a deposit limit. Deposit limits control how much users can wager over a certain period. The prompts greatly increased limit-setting relative to a noprompt control condition. But this increased limit-setting did not reduce net loss, total number of days gambled, or amount of money deposited in the following 90 days.
Finally, 6 precommitment studies (38%) reported that precommitment had mixed responsible gambling impact. To illustrate, Caillon and colleagues [56] randomly assigned users to self-exclude from an online gambling platform for a week. Self-exclusion prevents the user from using a certain gambling platform at all. Self-exclusion did not cause significant differences in money or time spent gambling fifteen days or two months after the self-exclusion began. There were, however, decreases in gambling illusions and desire to gamble two months later.
Information aids. In all, 24 of 36 user-directed tool studies (67%) examined information aids. Information aids provide the user with facts about a game's payout structure. They should reduce gambling to the extent that they correct misbeliefs that cause excessive play and should minimize losses to the extent that they help users make statistically optimal decisions. There were 8 information aid studies (33%) suggesting that information aids had a favorable responsible gambling impact. For example, when scratch card players had access to both the number of unclaimed prizes left and the payback percentage (the ratio of unclaimed prizes to total scratch cards remaining) in a numerical format, participants were influenced by the former even though it is only the latter that is diagnostic [104]. A follow-up experiment presented the same payback percentage information using a visual (a five-star rating system). Participants now more often relied on the payback percentage to make decisions about scratch cards, even though they did not understand the concept of payback percentage any better than participants in the first experiment. The visual rating system might have been interpreted by participants as recommendations for or against a given scratch card. So long as participants trusted the testimony of the experimenters, there was no need for participants to understand why payback percentages are more germane than counts of unclaimed prizes.
We observed that 6 information aid studies (25%) suggested information aids had no or unfavorable responsible gambling impact. For example, Beresford and Blaszczynski [49] tested multiple formats to improve understanding of return-to-player percentage. The concept refers to the percentage of money that a game returns to players in the long run, but players often believe that it approximates the percentage of stakes that remain with the average player at the end of individual sessions. This belief is incorrect because EGM winnings are designed to vary substantially in the short-term, and reinvesting wins tends to reduce earnings to zero. The authors reported that neither an infographic, a vignette, or a brochure increased understanding of return-toplayer percentage relative to the mandatory signage on EGMs in South Australia.
Finally, our charting indicated that 10 information aid studies (42%) reported mixed responsible gambling impact. For instance, "Basic strategy" is a decision tool for reducing the house edge in blackjack to less than one percent [116]. Phillips and colleagues conducted experiments where the blackjack program recommended the decision that was in accordance with Basic strategy to the user. The authors found that the presence of recommendations increased adherence to statistically optimal play, but also increased participants' willingness to make risky bets [97].

Regulatory initiatives
Our charting suggested that the gambling research literature has empirically examined two main regulatory initiatives: restricting the supply of EGMs (n = 2) and restricting EGM features (n = 2). The supply reduction studies did not show favorable impact. Delfabbro [60] observed how restrictions in the number of EGMs allowed in venues in South Australia increased gambling revenues. A follow-up survey revealed that most gamblers in the region had noticed that the number of EGMs had fallen, but only a minority had reported gambling less as a result.
The feature restriction studies did show evidence of efficacy. Hansen and Rossow [69] found that adolescent gambling-related problems declined in Norway after the government removed banknote acceptors from slot machines. Participants reported having gambled fewer times in the past year after the ban on banknote acceptors. In a follow-up analysis [68], the authors reported that this decrease held across all levels of gambling, although the decrease in the proportion of participants gambling at least 80 times a year was about three times larger than the decrease in participants gambling at least 20 times a year.

Replicability: Z-curve
The point estimates and 95% confidence intervals for the z-curve of the most focal hypothesis tests, as well as our 12 sensitivity tests (i.e., the highest p-values, the lowest p-values, and 10 iterations in which p-values from each study were randomly selected), are presented in Table 3. Because choosing the single most focal hypothesis test was difficult in many cases, we summarize the range of point estimates across the 13 iterations rather than focusing just on the single z-curve composed of p-values that we deemed most focal.
The Observed Discovery Rate ranged from .50 to .74 across our tests, indicating that most studies in the responsible product design literature report significant findings. The Expected Discovery Rate ranged from .10 to .58, suggesting that at least 42% of studies that have been conducted on responsible product design had null results. The Observed Discovery Rate was higher than the Expected Discovery Rate in all cases, by 204% on average. However, the confidence intervals of the Expected Discovery Rate were generally very large. In 8 of 13 iterations of z-curve (e.g., Most Focal and Smallest p), the upper confidence limit for the Expected Discovery Rate was higher than the point estimate of the Observed Discovery Rate, so this pattern is not statistically significant evidence of publication bias. In 4 cases (i.e., Largest p and Random 2), the point estimate for the Observed Discovery Rate was higher than the upper confidence limit of the Expected Discovery Rate. This is statistically significant evidence of publication bias. In one case (i.e., Random 4), the point estimate for the Observed Discovery Rate was well within the confidence interval of the Expected Discovery Rate and vice versa, providing some evidence against publication bias.
The Expected Replication Rate ranged from .60 to .79. The maximum false discovery rate ranged from .04 to .46. However, the confidence intervals for these estimates are so wide that they preclude our ability to make a general statement about the maximum proportion of significant findings that could be false positives. The file-drawer ratio ranged from 0.72 to 8.73. Similarly, the associated confidence intervals were too wide to allow us to draw conclusions about how many studies are conducted for each significant result that is published.
Re-running these 13 tests without studies that used a significance criterion other than a two-tailed alpha of .05 yielded a similar pattern of results, with the Observed Discovery Rate 149% larger than the Expected Discovery Rate on average. However, the Observed Discovery Rate was significantly higher than the Expected Discovery Rate in only one case (see S2 Table).

Discussion
We conducted a scoping review of 86 studies evaluating game-based responsible gambling tools that were published between 2001 and 2020 to better understand the current state of the literature. Several general trends in study design were apparent in charting the included studies. First, studies were most likely to involve structural tools, followed by user-directed tools, and then game-specific regulations. (Of course, in practice some jurisdictions allow players to opt in or opt out of tools that we categorized as structural because the authors conceptualized and implemented them as involuntary. We only categorized tools as user-directed if participants could choose whether to use them or the authors noted that they would be implemented as user-directed.) Researchers were more likely to test structural features by experiments, about equally likely to use experiments or observational methods to test user-directed tools, and used only observational methods to study regulations.
The median sample size was 136. This is higher than in social and personality psychology (median = 104) [117] but lower than in clinical psychology (median = 179) [118] during comparable periods. Most studies did sample actual gamblers. Those that did not sampled from the community as often as they sampled from university participant pools, the latter of which is unrepresentative of individuals who are at risk of gambling harm [119]. Most studies did not include follow-up periods, but some of those that did measured gambling months later.
Most studies included self-reports, usually of gambling-related problems. However, most studies (61.6%) also included behavioral measures that were captured by gambling records. Given the importance of observing behavior to uncover people's preferences, responsible product design research is faring much better than social psychology, where only about 6%-12% of the empirical papers in what many regard as the field's premiere journal (Journal of Personality and Social Psychology) feature behavioral measures [120].
It was rare in the responsible product design literature or in narrative reviews thereof to find a discussion of cultural moderators of a given intervention's efficacy. Most interventions so far have been tested in a small number of countries, such as Canada and Australia. There were no studies which primarily sampled populations in Asia, Africa, or South America. These results are consistent with general trends in social science more broadly, and pose similar risks for overgeneralization [121]. Ideally, researchers would articulate the specific characteristics of their sample that theory would predict to constrain the generality of results [122].

Preliminary conclusions about efficacy
Convergence in results across studies licenses some preliminary recommendations about which kinds of responsible product design are promising. Unfortunately, our review of the literature is consistent with earlier reviews: none of the game-based intervention tool types provide strong evidence for a particular strategy. Setting that reality aside, for structural feature tools, the best evidence supports pop-ups that encourage self-appraisal rather than pop-ups that attempt to rein in the influence of cognitive distortions. These pop-ups likely work in part because they create a brief break in play between trials. Imposing breaks long enough to effectively end a session, by contrast, increases craving and is irrelevant in settings where customers can easily switch to a different game. For covert interventions, modifications that undo features of games that promote excessive gambling likely have efficacy.
There are evidence-based reasons to doubt that user-directed tools are sufficient to prevent risky gambling. Many gamblers view pre-commitment tools as relevant only to people with gambling-related problems [123], and a common feature of experiencing such problems is a denial of excessive gambling [124]. Moreover, precommitment tools such as limit-setting and self-exclusion do not reliably reduce time or money spent gambling. On the other hand, ruling out riskier bets from the start [53] is a novel idea with some support. About a third of information aid studies indicated that they appear to have positive short-term effects, but long-term effects still require examination. Finally, with only four studies on regulatory initiatives, it is premature to draw conclusions about efficacy.

Replicability and transparency
We tested whether the responsible product design literature contains publication bias using zcurve. Point estimates for the Expected Discovery Rate were on average much lower than Observed Discovery Rate point estimates. Thus, there is probably some publication bias based on statistical significance in the responsible product design literature. Bias based on statistical significance could manifest through not publishing null results or using questionable research practices to obtain significant results. However, the confidence intervals for the Expected Discovery Rate were very wide, leaving the magnitude of publication bias in this literature unclear. Furthermore, z-curve is a relatively new method, so we counsel caution in interpreting our results as the final word on the replicability of studies of product safety in gambling.
To the extent that direct replications inevitably differ in some respects from the original studies, the point estimates of the Expected Discovery Rate suggests that most replications of significant findings in the responsible product design literature would fail. Conditions appear more favorable, however, if exact replications could be conducted, as the Expected Replication Rate ranged from .61 to .79. These estimates are in line with large-scale replication efforts in experimental philosophy [125], experimental economics [126], and social science experiments published in Science and Nature [18], but lower than research on associations between the five-factor model of personality and consequential life outcomes [127].
How do our estimates of replicability compare with that of the Open Science Collaborative [19], which attempted to replicate 100 studies published in high-impact psychology journals in 2008 and has played a large role in generating concerns about low replicability in social science? Bartoš  These observations suggest that the literature on responsible product design in gambling is less insistent on the inclusion of significant findings than publications in eminent psychology journals were in 2008, but is not more insistent on replicable findings.
An exploratory analysis of z-curve within game-based intervention type did not lead to greater clarity. In fact, the z-curve would only run for studies of pop-up messages and information aids due to there being too few significant effects for other types of interventions. For pop-up messages, the results were very similar to the overall dataset. For information aids, the extant studies had a slightly lower Observed Discovery Rate, a lower Expected Discovery Rate, but very similar confidence interval, and a slightly lower Expected Replicability Rate. Hence, our appraisal of the replicability of responsible product design appears to generalize across intervention types, though this could be an artifact of pop-up messages making up the majority of interventions studied.
Effect size magnitude also can speak to replicability, as studies of large effects (all else equal) have more statistical power, but very large observed effects can be symptomatic of infidelities in the research process. In the responsible product design literature, researchers often report only standardized effect sizes. We recommend that researchers prioritize unstandardized effect sizes because they typically frame research questions in unstandardized terms [26], such as whether a certain tool will reduce money or time spent gambling. Even dependent variables that use rating scales, such as screeners for gambling-related problems, can be imbued with meaning because they often have validated thresholds based on harm severity [128]. Standardized effect sizes are appealing in large part because they put effects composed of different variables on the same metric. Nevertheless, standardized effect sizes do not directly illuminate the relative importance of different predictors because they are influenced by the variance of the predictors and the outcome [26,129].
About one third of studies did not report any effect sizes. Journals could incentivize effect size reporting by insisting that researchers conduct power analyses. Few studies incorporated a power analysis, and about half of those that did reported a post-hoc power analysis, which is redundant with the p-value [129]. All included studies that did conduct an a priori power analysis used conventional benchmarks for what constitutes a medium or large effect [130]. Basing sample size on effect size conventions is dubious because it ignores the influence of measurement error [73]. Furthermore, small standardized effects may have large practical effects in the long-term or when scaled to a large population [27]. That said, we concur that power analysis should be based on the minimum effect size necessary to justify implementation. The smallest effect size of interest could be defined by the smallest change in the dependent variable that causes gamblers to report that they are experiencing less harm [131].
A lack of transparency undermines the benefit of adopting practices that increase replicability. Pre-registration limits obfuscation of which analyses were planned versus exploratory [24]. Although extant evidence does not support the contention that industry influences the methodology of responsible gambling research [132], pre-registration would be a worthwhile additional step to ensure independence between researchers and industry actors who fund them [133]. We found that only a handful of studies in the responsible product design literature were pre-registered, and all of them were published in 2019 or 2020. These findings are consistent with trends in psychological science, in which uptake of pre-registration between 2014 and 2017 was rare [134], perhaps because the studies on which such publications were based were conducted before knowledge of pre-registration was widespread.
Similarly, researchers must become more routine about disclosing their potential conflicts of interests and funding sources. About half of the included articles did not include a conflict of interest statement. Of course, journals have not always provided space to report conflicts of interest, or required a conflict of interest statement when they provide the space. Journals must do their part in requiring authors to report all funding sources and potential conflicts of interest.
More generally, the absence of widespread transparent practices in the published gambling literature is not completely surprising, and we do not draw attention to this issue to single anyone out. Contemporary dialogue related to open science principles and practices became widespread in related academic sectors around 2012 (e.g., psychology [135,136]); however, editorials addressing these topics to gambling researchers are more recent [137]. Supporting these editorials' calls for greater attention to open science with empirical evidence and transparent research practices should help advance meaningful change in gambling studies.

Study limitations
The primary limitation of the present scoping review is its scope. Official reports and internal government studies on product safety in gambling were not included. We also did not include unpublished studies, some of which may have been high quality. This method is grounded in the conservative approach of keeping the review limited to research that has undergone the formal scrutiny of peer review. This decision was also consistent with the use of z-curve, which estimates the mean power of a published literature after selection on statistical significance. It is possible that including pre-prints, gray literature, and abandoned works would have allowed for a more direct assessment of the hypothesis that published results are more likely to report significant results than unpublished work. On the other hand, the difficulty of tracking down all relevant unpublished studies would risk underestimating the extent of publication bias in the peer-reviewed literature based on statistical significance.
A second limitation is that the keywords and inclusion criteria may have led us to exclude or overlook studies that would have changed our conclusions about efficacy or replicability. These potential omissions might have affected the conclusions we drew about replicability and publication bias based on the results from z-curve. Third, a relatively small number of research groups have authored many of the included studies. The practices of those researchers might have a disproportionate influence on our inferences about how evaluations of game-based tools and regulations are designed, analyzed, and reported. Last, we could not include all eligible studies in z-curve because they did not include an inferential test of their most focal hypothesis.

Conclusion
The responsible product design literature has several reassuring trends, such as widespread use of experimental methods and behavioral measures. But uncertainty about the literature's overall methodological rigor, replicability and transparency precludes any strong recommendations about which interventions stakeholders should promote and implement. Ignoring these important factors, currently the product safety literature provides the best evidence, albeit limited evidence for pop-ups with self-appraisal messaging, breaks in between rounds of play, precommitment to less risky bets, undoing EGM features that promote excessive gambling, providing recommendations that minimize house edge, and removing banknote acceptors. Because the literature remains premature, we do not think that a meta-analysis on the responsible product design literature would settle this matter. Until there are a much larger number of high-powered, transparently reported studies, confident evidence-based product safety recommendations remain elusive.