The Weapons Identification Task: Recommendations for adequately powered research

This article synthesizes the extant literature on the Weapons Identification Task (WIT), a sequential priming paradigm developed to investigate the impact of racial priming on identification of stereotype-congruent and stereotype-irrelevant objects. Given recent controversy over the replicability of and statistical power required to detect priming effects, the aim of this synthesis is to systematically assess the literature in order to develop recommendations for statistical power in future research with the WIT paradigm. To develop these recommendations, the present article first quantitatively ascertains the magnitude of publication bias in the extant literature. Next, expected effect sizes and power recommendations are generated from the extant literature. Finally, a close conceptual replication of the WIT paradigm is conducted to prospectively test these recommendations. Racial priming effects are detected in this prospective test providing increased confidence in the WIT priming effect and credibility to the proposed recommendations for power.


Introduction
Adequately powered research is important for many aspects of a cumulative science. Relative to underpowered designs, adequately powered designs yield a) greater opportunity to observe true effects (if they exist), b) lower rates of false-positives (Type I errors) in the published literature, c) more precise estimates of an effect's magnitude, and d) greater interpretability of nullfindings [1][2][3].
With these considerations in mind, it will be productive to establish shared power guidelines for paradigms that are commonly used in the literature [4]. The purpose of this brief review is to generate and prospectively test power recommendations specific to the Weapons Identification Task (WIT)-a commonly used sequential priming paradigm developed to investigate the influence of stereotypes on the identification of stereotype-congruent and stereotype-irrelevant objects [5]. The theory and rationale for the WIT is similar to other racial priming tasks that involve weapon identification-namely the First-Person Shooter Task [6] and the shooter computer simulation task [7]. Although they share supporting theory and rationale these measures have low correspondence to each other, which may indicate that a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 behavioral performance on these tasks is driven by different mixtures of cognitive processes [8].
The present review has the following aims:

Weapons Identification Task
The Weapons Identification Task (WIT) is a variant of sequential priming procedures adapted from cognitive psychology. Participants completing the WIT view a series of trials that consist of one of two prime faces that differ by race (Black faces or White faces) and one of two target images that differ by object-type (guns or tools). In a standard implementation of the procedure, a fixation cross precedes the presentation of a prime image for 200-ms. The prime image is directly replaced by a target image for 200-ms with no inter-stimulus interval. Finally, the target image is backward masked until a response is given. Participants render dichotomous responses to indicate having seen either a gun or a tool. Thus the WIT is a 2 (prime type: Black face versus White face) X 2 (target type: gun versus tool) within-subjects design. From here the paradigm branches to investigate the effect of racial priming on either judgment reaction times or errors in judgment (for a review see [9]). Reaction time paradigm and effects. In the reaction time or 'RT' paradigm, participants respond as quickly as possible to identify target objects. Importantly, participant judgments in the RT variant are not constrained by a response deadline (see [5] Exp. 1). Effects reported in the literature take the form of an attenuated interaction between the two within-subjects factors (prime and target) on the reaction time to judgment. Participants correctly identify guns more quickly following Black versus White primes. In contrast, participants correctly identify tools either as quickly for both primes or in some cases more quickly following White versus Black primes (a crossover interaction).
Errors paradigm and effects. In the 'Error' paradigm, participants again respond as quickly as possible to identify target objects but must additionally register their judgments prior to a prescribed response deadline (see [5] Exp. 2). The response deadline reported in the literature ranges between 450-ms and 550-ms of target onset. Effects reported in the literature take the form of an attenuated interaction on error rates in judgment. Participants mistake tools for guns more often following Black versus White primes. Erroneous identification of guns either does not differ by prime type or, in some cases emerges as a full crossover interaction where guns are more often mistaken for tools following White versus Black primes.
bounds for excluding reaction time outliers in the WIT paradigm ( Table 1). Responses that are rendered too quickly are thought to reflect behavioral action slips and/or participant inattentiveness. Responses that are too slow can distort analyses and may also indicate participant inattentiveness. Researchers have used different strategies for handling these responses in the published literature. After log-transformation and exclusions, reaction time data aggregated at the participant level are submitted to repeated-measures ANOVA.
Errors analyses. In comparison to reaction time analyses, relatively few exclusions have been reported in the extant Error paradigm literature. Those exclusions that have been implemented are done at the participant level to mitigate the influence of inattentive participants [12]. Researchers may also consider the possibility of excluding those participants who have low accuracy rates, who utilize a single key in responding, or who respond with a single key at rates well outside of group means. After data cleaning, error proportions are aggregated at the participant level for each of the four trial types and submitted to repeated-measures ANOVA.

Assessing the evidential value of published literature
Before estimating an effect size for the WIT paradigm, it is important to consider and empirically test whether the reported literature likely contains evidential value and/or has been influenced by publication biases. To assess this, I first conducted a search of the literature for publications that reported data from the WIT. Studies were included in the analysis if they met two criteria. First they had to use a sequential procedure such that prime images preceded target images (i.e., SOA > 0-ms). Second, they had to use Black and White faces as prime stimuli and weapons and non-weapons as target stimuli. Searches were conducted on PsycINFO, Web of Science, and Google Scholar with the following keywords: weapons task, weapon identification, and weapon AND Payne. Additional articles were obtained via inspection of all articles that cited Payne (2001).
One technique for assessing the potential influence of publication bias is the use of the "p-curve" [13]. To conduct p-curve analyses, I aggregated F-statistics, their associated p-values, and ANOVA degrees of freedom from reported repeated-measures ANOVAs in the literature for both the RT and the Error paradigms. As an attenuated interaction is the primary prediction for both the RT and Error paradigms, the omnibus ANOVA interaction term was the statistic of interest [13]. The p-curve analysis plots the distribution of significant p-values (< .05) reported in the published literature. The shape of the distribution can then be used to infer whether there is evidential value in the published literature. A flat distribution indicates that the effect under consideration is likely "nonexistent". In contrast, a significantly right-skewed distribution indicates that the effect under consideration likely does exist. Finally, a significantly left-

RT paradigm
To assess the evidential value of published experiments reporting RT effects, I aggregated 15 relevant interaction test statistics in the extant literature ( Table 2). As shown in Fig 1, p-curve analysis (v. 4.052) for the RT paradigm indicated that the distribution had a significant right skew, Z = -7.67, p < .0001. This suggests that the WIT effect is a) likely to exist, and b) unlikely biased by extensive p-hacking.

Error paradigm
I aggregated interaction test statistics from 33 interaction test statistics in the extant literature ( Table 2). As shown in Fig 2, the results of the p-curve analysis for the Error paradigm indicated that the curve had a right skew, Z = -15.85, p < .0001.

Estimating WIT effect size
Results from the p-curve analyses indicated that the extant literature a) likely contained evidential value, b) was not detectably biased by intense p-hacking, and c) appeared highly-powered to detect the effect. The results of this analysis suggest that the effect sizes reported in the literature would be informative in estimating the effect sizes. Thus, effect sizes were computed for each published experiment. Researchers must decide which studies should be included in the estimate of each type of effect. As an example, Stewart and Payne [24] implemented an intervention intended to eliminate stereotypic biases and, therefore, the effect of interest. Notably, this intervention fell short of entirely eliminating the WIT effect, but arguably should not be included in calculating an average expected effect size for close replications of WIT that do not use this intervention. Likewise, some interventions sought to determine if situational manipulations (e.g., alcohol) would increase WIT bias, and these arguably should not be included. Thus for the present effect size analyses, I report estimations that first include all available experimental data. In a subsequent analysis, report estimations that only include experiments that I subjectively considered as close replications of the paradigm (excluding those that sought to attenuate or exacerbate WIT effects). Examples of close replications can include experimental data with minor modifications (e.g., [20,32]) and those paradigms that used the WIT to document individual differences (e.g., [8,19]). WIT effect sizes were estimated by fitting random-effects models in the 'metafor' package in the R statistical computing environment [46,47]. Each model accounted for nesting of experimental data set within reported studies. Additional robustness checks indicated that other plausible nesting of the data (e.g., by corresponding author) did not substantively impact the reported estimates. Results for both the RT and Error paradigms indicated that the interaction was reliable and that heterogeneity was detectable for each of the analyses (see Tables 3 and 4). Finally, funnel plot asymmetry tests did not detect bias in the RT literature (t(13) = -.118, p = .908); nor did it detect bias in the Error literature (t(31) = .087, p = .931). This comports with conclusions from the p-curve analyses using a more traditional metric for assessing publication bias.

Power recommendations
Power estimates were calculated using G Ã Power v3.1 [48]. Recommendations for number of participants are shown in Tables 5 and 6. There are several notes regarding their interpretation. First, these estimates of sample size are only for observing the fully within-subjects interaction in each of the WIT paradigms. Studies investigating the impact of situational interventions very likely need to be powered at much higher N than the present recommendations. Consider that the most effective intervention to reduce the WIT effect was unsuccessful in doing so [24]. In fact, there was still an observable interaction in the Error paradigm, albeit with an attenuated effect size (ω 2 = .045, η 2 partial = .063). In contrast, interventions emphasizing quick responding have produced WIT effects that were only directionally stronger than the estimated average effect size (ω 2 = .383, η 2 partial = .398). Thus, when powering experiments to investigate bias-interventions specifically, expect effect sizes to range between ω 2 = .04 and ω 2 = .40. Given this range, many more participants per experimental level may be needed to investigate the impact of between-subjects interventions.
Scientists investigating statistical power in replication and research design have differing recommendations with respect to power in experimental work. For example, Simonsohn's [49] small telescopes approach recommends at least 2.5x original sample sizes when attempting to replicate previous experimental work. As another example, Lakens and Evers [50] have put forward sequential analysis techniques designed to control Type I error rates while conserving scarce data collection resources. These and many more approaches are available that seek to balance the precision of parameter estimation and the allocation of researcher resources. In this vein, it is important to point out that the present power recommendations assume that meta-analytic effect size estimates are not biased by selective reporting. Although p-curve and funnel plot analyses did not detect systematic bias, this does not necessarily indicate that no bias is present. Thus, it remains possible that WIT effect sizes are upwardly biased in the reported literature. Researchers can and should use these recommendations as a starting point, modifying their sampling plan as needed based on available resources, desire for certainty or precision in estimation, and as additional data becomes available.

Independent replication of Payne (2001)
To complement this analysis of the literature and recommendations for power, I conducted an independent replication of the two experiments reported in Payne [5]. In doing this, I sought

Research design
The design of this replication can be considered "close" but not "exact." The differences between the replication and Payne [5] are as follows. First, I utilized previously validated racial prime stimuli that have not yet been investigated in the WIT literature [51]. The total set consisted of head and shoulders color photographs of 24 Black males and 24 White males. Second, I generated new target stimuli of both weapons and tools. These stimuli consisted of 5 guns and 5 tools. To reduce the possibility that participants could identify targets based on repeated presentation, I rotated each image by 90 degrees to produce 4 orientations for a total set of 40 target stimuli. Finally, I implemented a third set of neutral control prime images that consisted of the outline of a face (see [52]). It is possible that each of these modifications might produce results that diverge from that of the original paradigm. However, any differences these modifications produce would be informative when considered on a conceptual level. If we observe WIT effects with a) new prime stimuli, b) new target stimuli, and c) a new class of prime stimuli; we can then have increased confidence in theories that propose priming race produces differences in the speed and accuracy of identifying guns and tools may be generalized beyond the simple specification implemented in the original reported study [53]. If we do not observe WIT effects with these modifications, then theory must be constrained to reflect boundary conditions of the effect (e.g., "the effect does not occur with different target stimuli").
With the exception of the aforementioned differences, all other aspects of the procedure closely parallel those reported in Payne [5]. Participants completed the WIT protocol at individual cubicles in groups of 1-4. After providing informed consent, participants were informed that they were participating in a task investigating visual perception. After completing 18 practice trials each, participants all completed 216 critical test trials. On each trial, a visual fixation cross appeared for 500-ms. The fixation cross was replaced by a prime image presented for 200-ms. The prime was directly replaced by a target image presented for 200-ms. The target image was backward masked by a visual static image until a response was given. Participants received two self-paced breaks after each block of 72 critical trials. Finally, as a between-subjects manipulation, participants were either assigned to complete the task with a Table 5. Recommendations for power in number of participants for RT paradigm (1-β = 80% and 95%) by inclusion criterion.

1-β = .80 1-β = .95
All studies 28 [22,40] 46 [34,66] 500-ms response deadline or a 1000-ms response deadline. Whereas Payne [5] compared across two independent experiments, the between-subject manipulation of the response deadline allows for comparisons between the two conditions. For responses registered beyond the deadline, participants saw the message "Please try to respond faster!" for 2-seconds.

Participants
Given my recommendations for power in the two paradigms I sought to collect data from at least 40 participants each for the RT and Error paradigms. The data were not analyzed prior to surpassing the desired sample size. Eighty-seven undergraduates from UC-Davis participated in exchange for partial course credit (92 percent female; 55 percent Asian, 23 percent Latino/a, 17 percent White, 2 percent Black, and 2 percent unidentified). All participants gave written consent to participate and all study procedures were approved by the University of California Davis Office of Research. A computer error resulted in uninterpretable data for 5 participants, thus the final data set consisted of 42 participants in the 1000-ms condition (RT condition) and 40 participants in the 500-ms condition (Error condition). I set a priori criterion to exclude participants who used a single key in responding to all trials, but no participants met this criterion. Full data are available from OSF at osf.io/9e6sa/.

RT analysis.
The analysis plan is identical to that reported in Payne [5]. Only accurate identifications were included and reaction times less than 100-ms and greater than 1000-ms were trimmed from the analysis (4.62% of data for 1000-ms condition; 20.56% for 500-ms condition). A log transformation was applied to reduce positive skew in the resulting distribution [5]. Mean reaction times were aggregated for each trial type and subjected to mixed model ANOVA with response deadline (500-ms vs. 1000-ms) as a between-subjects factor and prime (Black vs. White) and target (gun vs. tool) as within-subjects factors.
Analysis of critical predictions. To enhance the direct comparability of the present replication with Payne [5], I analyze the critical prime x target interactions for the RT and Error paradigms separately. As described in the present literature, the RT effect should be most evident when the response deadline is longer (1000-ms) compared to when it is shorter (500-ms). In contrast, the Error effect should be most evident when the response deadline is shorter compared to when it is longer.
When the long response deadline was imposed, the expected prime x target interaction on reaction times was observed, F(1,41) = 35.395, p < .001, η 2 partial = .463 (see Fig 3). Note that this effect is stronger than expected, and falls outside the confidence interval, given estimates from the meta-analytic estimate. Both simple effects were detectable. Guns were identified more quickly following Black versus White primes, t(41) = 4.441, p < .001, CI difference [.040,.106], and tools were identified more quickly following White versus Black primes, t(41) = 4.372, p < .001, CI difference [.035,.094]. As discussed above, the effect was not statistically moderated by the response deadline factor.
When the 500-ms response deadline was imposed, the expected prime x target interaction on error rates was observed, F(1,39) = 7.939, p = .008, η 2 partial = .169 (see Fig 4). Note that this effect falls within the confidence interval of estimated effect size from the meta-analysis. Both simple effects were detected. Tools were more often misidentified following Black versus White primes, t(39) = 2.232, p = .031, CI difference [.008,.158], and guns were more often misidentified following White versus Black primes, t(39) = 2.819, p = .008, CI difference [.021,.123]. As described above, the effect was moderated by response deadline in the predicted directionstereotype-congruent errors were more common when the response deadline was 500-ms versus when it was 1000-ms.

Discussion
This brief review evaluated the reported literature investigating the Weapons Identification Task, a commonly-used sequential priming task. The review indicated that 1) there are differences in implementation and analysis of data in the paradigm and that 2) the published literature investigating the WIT paradigm very likely contains evidential value despite these differences and is not substantially impacted by publication bias. Given the favorable results of the publication bias analysis, I used effects reported in the extant literature to generate estimates of effect sizes for both the RT and Error WIT paradigms. Using estimated effect sizes I then generated recommendations for adequate power in each paradigm. The appropriateness of this strategy is contingent on the assumption that publication bias has not contaminated the literature. In many cases, this assumption may prove problematic; however, the p-curve analyses supported this assumption in the present analysis. Finally, I tested the efficacy of these recommendations prospectively by conducting a close independent replication of both the RT and Error paradigms. Both the RT as well as the Error interactions emerged in this prospective test, supporting the published literature and the proposed recommendations for power. Notably, the size of the RT effect was stronger than expected given the meta-analytic findings. The size of the Error effect fell within the confidence interval of the meta-analysis. These recommendations are not static and should be flexibly revised as additional data becomes available.
There are several limitations of the present work that should be noted. First, the reported meta-analytic estimates depend on the assumption that selective reporting has not biased published WIT effects. Tests of this assumption found no evidence for bias in the literature. However, there are relatively few significant interaction test statistics in the RT literature (k = 9) and therefore less power to detect bias. Simonson et al. [13] reports simulations of this case that suggest k = 10 is sensitive enough to find evidence that a set of studies lacks evidential value, especially when the literature appears to have highly powered studies (as appears to be the case with the WIT literature). Even if the reported literature is found to contain evidential value, we cannot be certain that effects in the reported literature are not upwardly biased. This possibility should be assessed as additional evidence is collected investigating WIT and as the p-curve technique is further probed and refined. It is also important to explicitly acknowledge the strengths and limitations of the current experimental replication. Although implementation of the WIT paradigm and corresponding data analytic techniques are relatively constrained, this does not rule out the possibility that researcher decisions can influence (even unconsciously) the interpretation of results [14]. So that others may independently evaluate the strength of the replication evidence I would like to reiterate that the sample size, task implementation (e.g., stimuli), and data analytic strategies were decided prior to data collection. Additionally the data were not examined at any intermediate point prior to the critical analyses. However, a limitation is that these plans were not preregistered on a public site. Preregistration is considered by some to be a powerful mechanism for increasing confidence in published results [54]. It is therefore appropriate for a skeptical reader to consider this when evaluating the current replication results.

Author Contributions
Conceptualization: AMR. 14. Gelman A, Loken E (Department of Statistics, Columbia University, New York, NY). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-