The Validity of Conscientiousness Is Overestimated in the Prediction of Job Performance

Introduction Sensitivity analyses refer to investigations of the degree to which the results of a meta-analysis remain stable when conditions of the data or the analysis change. To the extent that results remain stable, one can refer to them as robust. Sensitivity analyses are rarely conducted in the organizational science literature. Despite conscientiousness being a valued predictor in employment selection, sensitivity analyses have not been conducted with respect to meta-analytic estimates of the correlation (i.e., validity) between conscientiousness and job performance. Methods To address this deficiency, we reanalyzed the largest collection of conscientiousness validity data in the personnel selection literature and conducted a variety of sensitivity analyses. Results Publication bias analyses demonstrated that the validity of conscientiousness is moderately overestimated (by around 30%; a correlation difference of about .06). The misestimation of the validity appears to be due primarily to suppression of small effects sizes in the journal literature. These inflated validity estimates result in an overestimate of the dollar utility of personnel selection by millions of dollars and should be of considerable concern for organizations. Conclusion The fields of management and applied psychology seldom conduct sensitivity analyses. Through the use of sensitivity analyses, this paper documents that the existing literature overestimates the validity of conscientiousness in the prediction of job performance. Our data show that effect sizes from journal articles are largely responsible for this overestimation.


Methods
To address this deficiency, we reanalyzed the largest collection of conscientiousness validity data in the personnel selection literature and conducted a variety of sensitivity analyses.

Results
Publication bias analyses demonstrated that the validity of conscientiousness is moderately overestimated (by around 30%; a correlation difference of about .06). The misestimation of the validity appears to be due primarily to suppression of small effects sizes in the journal literature. These inflated validity estimates result in an overestimate of the dollar utility of personnel selection by millions of dollars and should be of considerable concern for organizations.

Conclusion
The fields of management and applied psychology seldom conduct sensitivity analyses. Through the use of sensitivity analyses, this paper documents that the existing literature overestimates the validity of conscientiousness in the prediction of job performance. Our data show that effect sizes from journal articles are largely responsible for this overestimation.

Introduction
Meta-analytic findings are viewed as a primary means for generating cumulative knowledge and bridging the often lamented gap between research and practice [1][2][3][4]. However, concerns regarding meta-analytic results and our cumulative knowledge remain [5][6][7][8][9][10]. Sensitivity analyses address the degree to which the results of a meta-analysis remain stable when conditions of the data or the analysis change [11]. To the extent that results remain stable, they can be considered robust. Unfortunately, the vast majority of meta-analyses in the organizational sciences fail to conduct sensitivity analyses and do not report the robustness of the meta-analytic findings [12] despite the fact that scientific organizations such as the American Psychological Association [13,14] and the Cochrane Collaboration [15] require or recommend such analyses.
Sensitivity analyses in meta-analytic studies include publication bias and outliers analyses. Publication bias occurs to the extent that research findings on a particular relation that are available are not representative of all research findings on that relation of interest [6,16]. Although publication bias analyses are rare in the organizational sciences, such analyses are much more common in other disciplines. For example, van Lent, Overbeke, and Out examined the role of review processes in the publication of drug trials in medical journals [17]. Kicinski examined publication bias in several meta-analyses in four major medical journals [18]. Publication bias has also been addressed in animal research [19,20]. In both the medical sciences [21][22][23] and the social sciences [24], publication bias appears to be primarily driven by authors who do not submit null or otherwise undesirable findings [16,25]. These authors are likely responding to journal policies that discourage the publication of research with non-significant findings as well as replications that can enable the evaluation of the credibility of previous research findings [10,26]. In addition to publication bias, outliers can have a noticeable effect on meta-analytic results [27,28]. Unfortunately, although outlier analyses are a type of sensitivity analysis [11], only around 3% of all meta-analyses in the organizational sciences report assessments of outliers [29].
Our analysis addresses the personality trait, conscientiousness. It is considered one of the "Big 5," a term that refers to five broad dimensions that succinctly describe human personality [30]. Shaffer and Postlethwaite [31] conducted the most comprehensive meta-analysis to date in which they assessed the correlation (i.e., validity) between conscientiousness and job performance (k = 113). Of the Big 5, conscientiousness was found to have the largest magnitude validity (the observed validity range for conscientiousness was .13 to .20 [31]). The authors found that the other Big 5 personality traits had observed mean validities that were less meaningful from a practical perspective in a selection context (i.e., where job performance is the criterion). A concern with the Shaffer and Postlethwaite study is that they concluded that the validity estimates for conscientiousness are not affected by publication bias [31]. However, they did not perform any sensitivity analysis. This paper applies sensitivity analyses, specifically publication bias and outlier analyses, to evaluate the robustness of their conclusions. To facilitate this task, we replicated their approach and crossed the frame-of-reference variable with all other moderators, which allowed us to reduce moderator-induced heterogeneity and to assess whether the influence of outliers and/or publication bias varied across sub-distributions. [11] and Kepes et al. [33]. Given that CMA is based on the Hedges and Olkin [34] tradition of meta-analysis, our results differed slightly from the psychometric meta-analysis method [35] used by Shaffer and Postlethwaite [36]. We note that the reliability coefficients of the personality scales in this data set are between .79 and .87 (based on the coefficients from the data set for measures that reported at least three reliability coefficients).
sampling error) [40,51,52]. Thus, in our analyses, the credibility of the trim and fill results is strongest in those sub-distributions in which we control for moderators (and thus control to some degree for heterogeneity) [33]. Regarding the influence of heterogeneity on PET-PEESE, Moreno et al. conducted a comprehensive simulation study that included variants of Egger's test of the intercept [53]. Two of these variants (fixed-effects model and fixed-effects variance model) correspond to the two components of PET-PEESE. They concluded that these variants can be inappropriate in very heterogeneous settings [53]. Similarly, as noted in the description of p-uniform [47], this method overestimates the mean effect as heterogeneity increases. To the extent that our data set has heterogeneity, p-uniform, and maybe also PET-PEESE, could be inappropriate. However, because both methods are relatively new, we argue that it is informative to apply them to our data set to see the extent to which their results converge with the results of the other, more established methods.
Finally, we note that some analyses use Fisher's z transformed Pearson's correlation coefficients (i.e., r). The transformation is used in some statistical methods, in part, because it makes the sampling distribution symmetrical. Given the relatively small magnitude of our correlations, the Fisher z coefficients and the untransformed correlation coefficients were nearly identical. Still, in the interest of making our analyses clear and our results fully replicable, we detail which statistical methods transformed correlation coefficients into Fisher z. The issue is whether the statistical method uses Fisher z transformed correlation coefficients in calculations. All methods that did use Fisher z coefficients in calculation used a back transformation of the results into untransformed correlation coefficients. The meta-analyses that yielded the random effect mean, the confidence interval for the mean, the Q test, the I 2 statistic, and the tau estimate were conducted with CMA, which uses Fisher z coefficients in calculations. Although CMA did not provide the prediction interval, we calculated it using output from CMA, which again does calculations using Fisher z correlations. Likewise, the one sample removed analyses, and the trim and fill analyses, were conducted using CMA and were thus based on Fisher z coefficients. The selection models use Fisher z transformed correlations as well. P-uniform was also conducted on Fisher z coefficients. The PET-PEESE and outlier analyses were conducted using untransformed correlation coefficients. We emphasize that for all results, the coefficients are in the metric of untransformed correlation coefficients and thus can be compared.
Decision rules for determining the range of the mean estimates and the magnitude of bias We relied on decision rules offered in Kepes et al. in determining the range of mean validity estimates and the magnitude of publication bias [33]. These decision rules are summarized here. First, we estimated the highest validity defined as the RE meta-analytic mean ( r o RE ). Next, we performed several sensitivity analyses, including the one sample removed analysis (osr), the trim and fill analysis (t&f r o ), and selection models with moderate (sm m r o ) and severe (sm s r o Þ assumptions of publication bias to derive additional mean validity estimates. We also conducted P-TES, PET-PEESE, and p-uniform analyses. We defined the highest validity estimate as the highest value from any analysis that provided an adjusted effect size estimate ( r o RE , osr, r o FE , t&f r o , sm m r o , sm s r o , and PET-PEESE) [33]. We excluded the results from p-uniform due to their lack of convergence with the results from the other, more established methods. We note that this is likely due to the heterogeneity of our data [47].
We defined the lowest validity estimate as the smallest value from any of these seven analyses (i.e., r o RE , osr, r o FE , t&f r o , sm m r o , sm s r o , and PET-PEESE). We defined the baseline range estimate (BRE) as the absolute difference between r o RE and the validity estimate farthest away (either the lowest or highest value). We defined the maximum range estimate (MRE) as the absolute difference between the lowest and the highest value. When calculating the relative difference of the range estimates, we used r o RE , the potentially best mean estimate, as the base (i.e., as 100%). Consistent with Kepes et al., we characterized the magnitude of publication bias as negligible if the relative range (BRE or MRE) was smaller than 20%, as moderate if the relative range (BRE or MRE) was between 20% and 40%, and as large if the relative range (BRE or MRE) was larger than 40%. For the P-TES estimates, we used the decision rules from Francis [44] to determine whether the data were suspect (i.e., a probability of .1 or less is consistent with an inference that the data should be viewed with skepticism). We find the decision rules from Kepes and colleagues reasonable, and note that other researchers have used them [54], and, to date, no critiques of them have been offered. However, readers may choose to adopt other decision rules. We provide the data and results needed to assist the reader in such an effort.

Results
Using the approach detailed by Viechtbauer and Cheung [48] and the diagnostics and criteria for determining whether a particular study is an outlier described by Viechtbauer [49], we identified one outlier (the correlation coefficient from Lao [55]). We verified that the study was correctly coded (see [55], p. 32, Table 1). The sample is composed of police officers ("State Troopers"). Other research has found lower than typical prediction of law enforcement job performance from measures of general cognitive ability and employment interviews [33,56]. Hirsh and colleagues speculated that the lower magnitude correlations may be due to the supervisor having limited opportunity to observe the work of the police officer [56]. Police officers typically patrol alone in their police car out of the view of their supervisor. Our results by sub-distributions are presented in Table 1 for all primary samples and in S1 Table contains the results without the one identified outlier. Table 1 contains the results of the conscientiousness analyses conducted for the full distribution and publication bias results are offered for all sub-distributions with at least 10 correlations. The first two columns in Table 1 show the distribution analyzed and the number of samples (k) in the distribution. Columns three through nine display the results from the metaanalytic RE model: the mean observed correlation ( r o RE ), the associated 95% confidence interval (95% CI), the associated 90% prediction interval (90% PI), the Q statistic, I 2 , τ, and the onesample removed analysis (minimum, maximum, and median mean validity estimates). The next four columns (10 through 13) contain the results from the trim and fill analysis, including the side of the funnel plot where the samples were imputed (FPS; a left-hand side imputation is consistent with an inference of publication resulting from the suppression of small magnitude effect sizes; [33,40]), the number of imputed samples (ik), the trim and fill adjusted observed mean correlation (t&f r o ), and the trim and fill adjusted 95% confidence interval (t&f 95% CI). Columns 14 and 15 display the results from the moderate and severe selection models, including their respective adjusted observed estimates for instances of moderate and severe publication bias (sm m r o and sm s r o ) and their respective variance component. Column 16 provides the probability for the test of excess significance (P-TES). We report the probability of the chisquare test as the P-TES value and note that this is a probability of excess significance and is not an effect size. The next two columns, column 17 and 18, display the PET (precision-effect test) and PEESE (precision effect estimate with standard error) adjusted observed mean estimates (i.e., PET r o and PEESE r o , respectively; the PET r o column also includes its associated one-tailed p-value, which is used to determine whether the PET r o or the PEESE r o is the adjusted observed mean for the meta-analytic distribution [45]). The final column contains the n/a n/a n/a n/a n/a p-uniform adjusted estimate of the mean effect size and its 95% confidence interval (p-uniform [95% CI]). We note that the P-TES values changed substantially in a few distributions when the sole outlier was dropped, suggesting that the outlier substantially influenced the P-TES value. These differences can be examined by comparing Table 1 with the S1 Table. For example, for the nonjournal article sub-distribution of effect sizes, the P-TES value including the outlier was .53, but .95 with the outlier dropped. For the non-journal articles which used a non-contextualized measure, the P-TES was .65 when including the outlier and .76 without the outlier. When the purpose of the measure was classified as general purpose, P-TES was .32 with the outlier and .84 without it. When the research design was concurrent, the P-TES value including the outlier was .12 and thus approached a value (.10) in which one might draw an inference of a non-credible data set. However, the P-TES rose to .49 when the outlier was removed from the data set. Based on these results, when P-TES is used as a sensitivity analysis in a meta-analysis, we recommend that it be conducted with and without outliers to determine the robustness of the results. Using the typical criterion .10 or less [44], neither the full distribution nor the sub-distributions were judged to be non-credible sets of data. Concerning the results reported in Table 1, we found . 28 .21 (.14) . 21 .20 (.14, Note: k = number of correlation coefficients in the analyzed distribution. Publication bias analyses were not conducted for distributions with less than k = 10; r o RE = random-effects weighted mean observed correlation; 95% CI = 95% confidence interval; 90% PI = 90% prediction interval; Q = weighted sum of squared deviations from the mean; I 2 = ratio of true heterogeneity to total variation; τ = between-sample standard deviation; osr = one-sample removed, including the minimum and maximum effect size and the median weighted mean observed correlation; Trim and fill = trim and fill analysis; the chi-square test of excess significance; p-uniform (95% CI) = the p-uniform estimate and its 95% confidence interval; n/a = not applicable (because k was too small to conduct these analyses or because the variance component for the selection models indicated that the estimate was nonsensical [33] varying degrees of robustness in the meta-analytic mean (i.e., validity) estimates for conscientiousness. For the entire distribution (k = 113), the RE meta-analytic mean estimate (. 16) was robust to the one sample removed analyses (e.g., the mean estimate did not change). However, the 90% prediction interval, which indicates the likely range of "true" effect sizes, is relatively wide (.03, .29). Furthermore, the RE meta-analytic mean estimate was not robust to all publication bias analyses. Specifically, the trim and fill estimate of .13 and the severe selection model estimate of .12 were noticeably smaller in magnitude than the RE estimate. Confirming these results, the PET-PEESE estimate was .13 (because PET was significant, the PEESE adjusted mean estimate was selected [45]). The PET test (.09, p < .001) supports the results from the trim and fill analysis by indicating that the effect size distribution is asymmetric; that small magnitude effect sizes are likely to be missing from the meta-analytic distribution.
The contour-enhanced funnel plot (see Fig 1a) shows that all but one of the 23 imputed samples were in the area of statistical insignificance, which is consistent with an inference of publication bias stemming from the suppression of small magnitude correlations [33,50]. The forest plot for the cumulative meta-analysis by precision shown in Fig 2a suggests  k cum = 113) with the addition of even smaller samples. This is consistent with an inference of publication bias resulting from the suppression of small magnitude correlations (from small samples). These patterns, especially the one from the contour-enhanced funnel plot, are also inconsistent with the notion that the small sample bias (i.e., small sample studies show systematic differences from larger sample studies due to assessing different populations or having measures of different sensitivity) is the cause for the observed results [33,50]. We conclude that publication bias has likely affected the observed mean validity of conscientiousness for predicting job performance such that it is likely to be smaller in magnitude than the RE metaanalytic mean of . 16. We note that most of the bias stems from journal articles (see Table 1 as well as Figs 1 and 2), which is consistent with an inference of the suppression of statistically non-significant results. Thus, it is the literature published in journals that is largely responsible for distorting the research on the validity of conscientiousness.
The results from the analyses without the outlier were similar. Therefore, we do not discuss the analyses or results without the one outlier. However, the results for all analyses without the outlier are provided in the supplementary materials (see S1 Table).
Our findings, including the range estimates (BRE and MRE) and conclusions, are summarized in Table 2 (S2 Table contains the conclusions for the sub-distributions without the sole outlier). We note that the range estimates are not necessarily perfectly comparable if the severe selection model did not provide a sensible solution (indicated by n/a in Table 1; see [41]). For these distributions, the observed range estimates may be smaller when compared to distributions where the full range of estimates is available. In addition, the results of the p-uniform analyses did not converge well with the results from the other, more established methods. Most likely, this is largely due to the heterogeneity in the data. The article that introduced p-uniform [47] provided simulation evidence that it noticeably overestimates the effect size as heterogeneity increases. We note that our I 2 values are typically near about 50, indicating non-trivial heterogeneity, which adversely affects the performance of p-uniform [47]. Correspondence with one of the authors of the article introducing the p-uniform method, while informative, did not result in a decision rule concerning the magnitude of I 2 values at which p-uniform should not be used [57]. Because of the nonconvergence with the results from the other, more established methods and our substantial uncertainty about the appropriateness of the p-uniform approach for these data (e.g., van Assen et al. [47] noted that this method performs poorly with heterogeneous data, which may explain why the p-uniform results generally did not converge with the other results), we excluded the results from our conclusions and Table 2 (and S4 Table; for conclusions of the results with the sole outlier that includes the results from the p-uniform analysis, see S3 Table).
Based on the sum of evidence, we conclude that the conscientiousness data are not meaningfully influenced by a sole outlier. We also found that, in general, the data on conscientiousness are noticeably affected by publication bias. Thus, the apparent suppression of small magnitude effect sizes, which the contour-enhanced funnel plots indicated to lie predominantly in the area of statistical insignificance, likely has led to the overestimation of the validity of conscientiousness. The results for the sub-group distributions of samples from journal articles (k = 67) and non-journal sources (k = 46) support this notion because samples published in journal articles reported larger average effect size estimates ( r o RE = .19) than samples from non-journal sources ( r o RE = .12; see Table 1). Distributions involving journal articles tended to be the most non-robust as well, typically with differences of at least .10 and overestimations of more than 60% (see Table 2). For illustrative purposes, we also provide the contour-enhanced funnel plots for both of these distributions as well as the forest plots from the respective cumulative meta-analysis by precision (see Fig 1b and 1c as well asFig 2b and 2c). The contourenhanced funnel plots and the cumulative meta-analyses by precision support an inference of publication bias and an overestimation of the mean validity for data from journal articles as well [33]. By contrast, the data from non-journal sources seems to be relatively robust to publication bias (see Table 2). Thus, it is the data from journal articles that are largely responsible for distorting the research on the validity of conscientiousness.
In addition, we found that the RE mean validity estimates for distributions involving contextualized measures of conscientiousness were sometimes more robust than the mean estimates for distributions involving non-contextualized measures. For the distribution of all noncontextualized measures of conscientiousness (k = 91), the 90% prediction interval ranged from .00 to .29. By contrast, the prediction interval for the distribution of contextualized measures (k = 22) ranged only from .16 to .22. However, for many other distributions, the contextualization of conscientiousness measures did not matter. Often contextualized and noncontextualized sub-distributions were non-robust to a similar degree (non-robust to a moderate or even large degree [see Table 2]).
Although one may argue that the absolute difference between the RE meta-analytic mean estimates and the publication bias adjusted mean estimates tend to be rather small in magnitude (i.e., approximately .06 for most distributions), the relative differences tend to be noticeable (i.e., typically greater than 30%) and may be interpreted as moderate in size [6,33]. Furthermore, for data from journal articles, the overestimation appears to be large, for contextualized as well as non-contextualized measures of conscientiousness (see Tables 1 and 2).
Based on a reviewer request, statistical significance tests are provided in Table 3 for the moderator subgroups analyzed in Table 1. Results in S4 Table are for the data set with the sole outlier removed. correlations that may be described as marginally significant (p-values ranging from .05 to .10). The lighter gray area contains correlations that are statistically significant (p < .05). Note that most of the imputed correlations are found in the data distribution drawn from studies published in journals; relatively few of the imputed correlations are found in the data distribution drawn from unpublished studies. This fact is consistent with an inference that publication bias in the full data distribution is largely due to the suppression of statistically insignificant correlations in journal published articles. Thus, it is the journal articles that are largely responsible for distorting the research on the validity of conscientiousness.

Discussion
Publication bias and outliers can distort meta-analytic results and conclusions [3,16,27,28,33]. Unfortunately, most meta-analyses in the organizational sciences fail to conduct sensitivity analyses to assess the effect of these phenomena and do not report results regarding the Forest plots for the cumulative meta-analyses by precision for the validity of conscientiousness (i.e., the correlation between conscientiousness and job performance) are displayed. To obtain the plots, validities were sorted from largest sample size to smallest sample size and entered into the meta-analysis one at a time in an iterative manner. The lines around the plotted means are the 95% confidence intervals for the meta-analytic means. For panels A and B, the mean validities drift from smaller to larger as correlations from smaller and smaller sample size studies are added the to the distribution being analyzed. For Panel C, no noticeable drift is observed. The drifts from smaller to larger meta-analytic means are consistent with an inference of statistically insignificant correlations from smaller sample size studies being suppressed (i.e., publication bias). The lack of meaningful drift in panel C suggests that the data suppression is largely in the journal published articles (see panel B). Thus, it is the data published in journal articles that are largely responsible for distorting the research on the validity of conscientiousness.  robustness of meta-analytic findings [12,33] even though the American Psychological Association [13,14] and other scientific organizations (e.g., the Cochrane Collaboration [15]) recommend such analyses [36]. We note that even a journal published by the American Psychological Association (i.e., the Journal of Applied Psychology) seldom reports sensitivity analyses in their meta-analytic studies despite the recommendation of the organization that owns the journal.
In this study, we used a variety of sensitivity analyses to assess the robustness of claims regarding the validity of conscientiousness for predicting job performance.
Overall, we conclude that the observed validity for conscientiousness is overestimated in the literature. This overestimation is primarily due to the influence of publication bias and not outliers. However, the lack of a distorting effect due to outliers may not be true for other literature areas and, in accordance with best meta-analytic practices [36], we encourage the use of outlier analyses in all meta-analytic reviews. We note that some sub-distributions were less robust than others (see Table 2). The non-contextualized sub-distribution (k = 91) misestimated the validity of conscientiousness to a large degree (40% to 47%; see Table 2). By contrast, effect sizes drawn from studies with contextualized conscientiousness measures (k = 22) seemed to be relatively robust and free of publication bias. However, such differences in the degree of robustness between non-contextualized and contextualized measures of conscientiousness were not always evident. Data from journals (k = 67) misestimated the validity of conscientiousness by a large degree (63%). Data from non-journal sources (k = 46) typically showed negligible to moderate degrees of publication bias, indicating that most of the apparent data suppression is associated with journal articles.
In addition to these findings, misestimation was evident with general purpose measures (k = 76) and was judged relatively large (43% to 50%). On the other hand, data drawn from studies that used a workplace purpose measure of conscientiousness were only negligibly or moderately affected by publication bias (k = 37; 16% to 21% misestimation), particularly if they involved contextualized measures (5% to 10%) as opposed to non-contextualized measures (53% to 58%). Misestimation was moderate for incumbent samples (k = 109; 31%). The degree of contextualization did not matter as both incumbents' sub-distributions (incumbents and non-contextualized measures; incumbents and contextualized measures) were non-robust to a large degree (e.g., their meta-analytic means were misestimated by up to potentially over 40%; see Table 2). The misestimation of concurrent designs (k = 105) was judged moderate (27% to 31%). As with the incumbent samples, the degree of contextualization did not matter as the meta-analytic means of both distributions were overestimated by up to a large degree (misestimation for concurrent designs and non-contextualized samples [k = 86]: .06 [40%]; misestimation for concurrent designs and contextualized samples [k = 19]: .07 [39%] to .08 [44%]). Finally, there were sufficient data for three specific measures. Validities based on the NEO Personality Inventory [58] showed large misestimation (43%) whereas there was negligible to moderate (17% to 21%) publication bias in validities drawn from the Personal Characteristics Inventory (PCI [59]) and negligible bias (5%) involving the Personal Style Inventory (PSI [60]).
Based on our findings, we conclude that the validity estimates from non-journal sources are likely to be more robust than estimates from journal articles. Although the goal of our paper was not a critique of Shaffer and Postlethwaite's study [31], the presence of potentially severe publication bias in samples from journal articles indicates that Shaffer and Postlethwaite overestimated the validity of conscientiousness measures, especially for non-contextualized measures of conscientiousness but also for contextualized ones. With regard to particular measures, it appears that the mean validity estimate for the NEO was comparatively low (e.g., r o RE = .14 vs. .24 and .22 for PCI and PSI measures, respectively) and non-robust (e.g., the estimate from the severe selection model is .08, suggesting that the RE mean estimate for NEO measure, already lower in magnitude than the estimate from the PCI and PSI (.14, .24, and .22, respectively), was overestimated by 43%). By contrast, the PCI and PSI measures seem to have validity estimates that are relatively robust and larger in magnitude.
It is possible that the NEO shows more publication bias than other measures because the NEO is a commercial employment test product while at least some of the other measures in the analysis are not commercially sold. McDaniel, Rothstein, and Whetzel drew inferences consistent with a conclusion of publication bias when examining several commercially sold employment tests [61]. They speculated that results that may damage the marketing of commercial products might be suppressed. The validity distribution of "Test Vendor A" had evaluated potential publication bias in the PCI, and consistent with the current paper's results, found no compelling evidence of publication bias.

Recommendations and limitations
There are several possible critiques of this research, two of which are described below. First, we note that some might argue that differences expressed in percentages might be better expressed as differences in correlation magnitude. Our tables present both. Second, some might argue that a correlation inflation of some magnitude (e.g., .05) due to publication bias (and/or outliers) is a small difference and not likely to be meaningful. In the context of predicting job performance, one approach to assess the meaningfulness is to calculate the dollar value of differences in validity.
Using the mean validity estimate for data from journal articles (r = .19) compared to the trim and fill adjusted estimate (r = .14), we compute the dollar value on assuming one value over the other. For these calculations, we used 40% of salary as the estimate of the standard deviation of job performance in dollars following Hunter and Schmidt [62] and $44,888 as the average salary in the United States [63]. We estimated the standard deviation of job performance in dollars as .4 Ã 44,888 = $17,955.20. We assumed that the average test performance of those hired is the score at the 85 th percentile of those completing a conscientiousness measure.
Using a common utility formula (formula 1 in [64]) and assuming 100 employees were hired who work for 20 years, the utility value for a validity of .19 is about $1,800,000 larger than the utility value of a validity of .14. Thus, the use of incorrectly inflated validity coefficients due to publication bias or other phenomena sharply overestimates the dollar utility of personnel selection by millions of dollars. This should be of considerable concern for organizations. We acknowledge that different assumed values yield different results. For example, in more recent cohorts of employees, one may observe that few employees stay in an organization or job for 20 years. Thus, our estimates could be modified by considering repeated costs per hire. Yet, any reasonable values will show sharp overestimates of the utility in dollars when validity estimates are overestimated due to publication bias. Furthermore, we note that more sophisticated utility analyses (e.g., analyses with multiple predictors) could be conducted. Our simple utility analysis is offered to show that effect size differences of around .05 are not necessarily trivial in magnitude.
We have drawn inferences about publication bias in part from small sample studies having relatively large magnitude effect sizes when compared to large sample studies, on average. The assertion of publication bias is also supported by the contour-enhanced funnel plot and the cumulative meta-analyses by precision (see Figs 1 and 2). Here, we consider the alternative explanation that the mean effect size differences between small and large sample studies is due to "true" differences between such studies and thus not due to publication bias. One example scenario concerning "true" differences in small vs. large studies is from the medical literature, in which small sample studies may be drawn from a different population than larger sample studies. In medical interventions, small samples might be drawn from a population of very ill patients and result in larger effects sizes than larger samples, which may be drawn from a population of less ill patients. However, we have no theoretical or empirical evidence that the samples for the conscientiousness-job performance relation are drawn from such different populations.
A second scenario for "true" differences in small and large samples concerns the sensitivity of the measures. Consider a study assessing stress effects on humans. In smaller studies, it may be financially feasible to collect physiological measures of relevance to stress. However, such measures are likely more costly than self-reports of the effects of stress. In large sample studies, self-report survey data may be more common because the potentially more sensitive physiological measures are financially infeasible to collect in large samples. This may result in larger effects for the smaller studies than for the larger studies. However, in our study, the self-report measures used for smaller and larger studies are not distinguishable. Based on this reasoning, the evidence from the contour-enhanced funnel plot, and the fact that our results indicate that virtually all of the sample suppression is evident in data drawn from journal articles as opposed to non-journal sources, we conclude that "true" differences between the smaller and larger studies, other than sample size, are unlikely to be credible. Thus, we are confident that the difference in mean effects between small and large studies is best attributed to publication bias.
We suggest that meta-analytic researchers present a range of parameter estimates rather than a single point estimate [38,65,66]. In the context of meta-analytic reviews, triangulation means the use of multiple meta-analytic estimates, outlier identification, and publication bias detection methods to estimate the range of results rather than relying on a single point estimate [33]. According to Orlitzky [38], the use of multiple estimates permits the triangulation of results, which is important in advancing the methodological rigor in the organizational sciences and in obtaining more accurate and trustworthy results [10]. This approach is aligned with customer-centric reporting as researchers and practitioners benefit from understanding the robustness of a meta-analytic estimate [36,67]. This recommendation is also supported by the Meta-analysis Reporting Standards of the American Psychological Association [13,14,36] and other scientific organizations, such as the Cochrane Collaboration [15].
If space considerations in journals prohibit detailed reporting of results, they should be made available on journal websites as supplementary information, a practice that is common in the medical sciences [68] and cross-disciplinary journals such as PLoS ONE. We suggest that such practices should become more common in psychology and management journals. Robust and non-robust estimates are equally informative about meta-analytic results and the associated conclusions. In the former case, the findings provide assurance regarding the accuracy of the meta-analytic estimates. In the latter case, non-robust results aid in the re-evaluation and revision of previously made conclusions, thereby directing new research efforts.
With regard to specific methodological recommendations for the detection of outliers, we suggest the use of the one-sample removed analysis to empirically assess the influence of each individual sample on meta-analytic results [37]. This analysis provides a range of results. We also recommend outlier analyses and the reporting of results with and without outliers. We reported the results using Viechtbauer and Cheung's approach for outlier detection [48] with the diagnostic measures and criteria for determining whether a study is an outlier described by Viechtbauer [49]. In addition, we used Beal and colleagues SAMD statistic to identify outliers [69], which yielded very similar results and essentially identical conclusions (we conducted the SAMD analyses at the sub-distribution level of analysis). However, due to potential problems with the SAMD approach when the data are heterogeneous (the approach does not take [residual] heterogeneity into account; we thank an anonymous reviewer for highlighting this issue), we did not report the results.
Regarding the publication bias assessment methods, it is important to note that funnel plotbased methods (e.g., contour-enhanced funnel plot and trim and fill) are based on the degree of asymmetry in the funnel plot. Publication bias is one possible cause for the observed distribution asymmetry. Outliers and heterogeneity, either due to moderators or "true" differences between small and large samples (i.e., the small sample bias; [33,50]), are other possible causes. We accounted for the possible heterogeneous effects of outliers by running all analyses with and without the sole outlier. We used the contour-enhanced funnel plot to distinguish publication bias from other causes of funnel plot asymmetry [39,50]. As noted previously, we used the originally identified moderators by Shaffer and Postlethwaite and formed sub-distributions to reduce the degree of between-sample heterogeneity [51,70], minimizing the possibility that funnel plot asymmetry resulted from this type of heterogeneity [33,70,71]. Also, our results were relatively consistent: the distributions with data from journal articles displayed noticeable publication bias while the distributions with data from non-journal sources showed negligible bias. Furthermore, all distributions involving non-contextualized measures of conscientiousness were affected by publication bias; their meta-analytic mean estimates were always nonrobust. Thus, it seems unlikely that heterogeneity caused our results.
In addition, results of selection models, which are less affected by heterogeneity [33,41,42], provided supporting results and should receive considerable weight when estimating the effect of publication bias on meta-analytic results [33]. For virtually all distributions, the varying publication bias methods yielded similar results. A key exception were the results from the p-uniform analyses. They did not converge well with the results of any of the other, more established methods. The p-uniform method may have been inappropriate for our data set given the degree of heterogeneity [47]. Given that the degree of heterogeneity tends to be similar in other areas in applied psychology and management, it may be that p-uniform is not appropriate for most data sets in these research areas. Future research should investigate this issue. Similar caveats may apply to the PET-PEESE analysis [53], although our results tended to converge relatively well, especially when compared to the p-uniform results. Finally, as discussed previously, we are not aware of any empirical evidence or theoretical rationale to suggest that the small sample bias has caused our results. This conclusion is also supported by the patterns of the contourenhanced funnel plots [33,50].
Our findings are aligned with previous warnings regarding the influence of phenomena such as publication bias and outliers in meta-analytic reviews [6,15,27,28,61,[72][73][74][75]. Given our results, we argue that suggestions regarding the irrelevance of sensitivity analyses, particularly publication bias analyses [76], are clearly incorrect for the conscientiousness literature. We thus advocate comprehensive sensitivity analyses in all meta-analytic reviews to determine the degree of potential misestimation in meta-analytic results [11,13,14,36]. We note that outliers and/or publication bias may not be present in all meta-analytic reviews. For example, the sole outlier did not affect our results and publication bias did affect the data from journal articles noticeably more than data from non-journal sources. Even when outliers and/or publication bias are present, they may not substantially affect the results and conclusions of all metaanalytic distributions and results. However, these phenomena may have a substantial effect on some meta-analytic findings. Thus, we recommend that sensitivity assessments always be reported in journal articles or the articles' supplementary materials, regardless of whether or not outliers and publication and related biases affect meta-analytic results.
Currently, we do not know the degree to which phenomena such as outliers and publication bias have affected our cumulative knowledge. To provide such information and more accurate meta-analytic results, we support calls for comprehensive sensitivity analyses in all meta-analytic reviews, which is aligned with recommendations from the Meta-analysis Reporting Standards of the American Psychological Association [13,14] and previous research efforts [3,11,33,36,75].
Researchers in applied psychology and the organizational sciences typically know the extent of measurement error in their data and sometimes have information on range variation on variables of interest. With such information, researchers can use psychometric meta-analysis methods [35] to obtain mean estimates of effects that would be obtained in the absence of measurement error and range restriction (or range enhancement). Such estimates are useful for studying relations among variables at the construct or latent level, contribute to theory clarifications, and are valuable in practical applications (e.g., comparisons of the value of various employment screening procedures). Unfortunately, current publication bias methods have not been designed with psychometrically-adjusted effect sizes in mind. Nor are the publication bias methods accommodating to psychometric meta-analytic perspectives on study weighting (sample size vs. inverse variance weighting), effect size transformations (i.e., Fisher z), and sampling error estimation (i.e., estimate of rho in sampling error estimates). Psychometric metaanalysis methods that correct individual effect sizes for measurement error and range variation issues yield effect sizes that could be used in current publication bias methods if the standard errors are appropriately adjusted (and the methods do not estimate sampling error from sample size and the observed effect size). Still, there is no current research assessing the accuracy of publication bias methods using the psychometric approach. Also, to our knowledge, there are no publication bias methods that are adaptable to psychometric meta-analysis approaches using artifact distributions. We encourage efforts to evaluate current methods of publication bias for psychometric meta-analysis applications.

Conclusions
Sensitivity analyses are rarely performed in the organizational sciences [12,29,72]. Despite suggestions to the contrary [76], we found that publication bias can have noticeable effects on meta-analytic results. Our findings illustrate the need for a rigorous quantitative assessment of the robustness of meta-analytic results. Errors in primary studies are problematic, but those affecting the conclusions of a meta-analytic review can mislead future research directions and misinform evidence-based practice [3,4,33]. We encourage the use of methods for the detection of outliers and publication bias in all meta-analytic reviews, and, aligned with the approach of triangulation [38,65,66] and customer-centric science [67], the reporting of the range of results. Journals, which should provide the best estimates of effect sizes, are, ironically, providing the most biased estimates. Clearly, journal polices and the behavior of authors responding to journal policies, are in need of substantial revision [10].
Supporting Information S1 Table. Meta-analytic and publication bias results (outlier excluded). k = number of correlation coefficients in the analyzed distribution. Publication bias analyses were not conducted for distributions with less than k = 10; r o RE = random-effects weighted mean observed correlation; 95% CI = 95% confidence interval; 90% PI = 90% prediction interval; Q = weighted sum of squared deviations from the mean; I 2 = ratio of true heterogeneity to total variation; τ = between-sample standard deviation; osr = one-sample removed, including the minimum and maximum effect size and the median weighted mean observed correlation; Trim and fill = trim and fill analysis; FPS = funnel plot side (i.e., side of the funnel plot where samples were imputed; L = left, R = right); ik = number of trim and fill imputed samples; t&f r o = trim and fill adjusted observed mean (the weighted mean of the distribution of the combined observed and the imputed samples); t&f 95% CI = trim and fill adjusted 95% confidence interval; sm m r o = one-tailed moderate selection model's adjusted observed mean (and its variance); sm s r o = one-tailed severe selection model's adjusted observed mean (and its variance); Ex. sig. = excess significance; PET-PEESE = precision-effect test-precision effect estimate with standard error; PET = PET adjusted observed mean (and its one-tailed p-value; the value from PEESE is the adjusted observed mean if the PET value is significant, the value from PET is the adjusted observed mean if the p-value is not significant [45]); PEESE = PEESE adjusted observed mean; P-TES = the probability of the chi-square test of excess significance; p-uniform (95% CI) = the p-uniform estimate and its 95% confidence interval; n/a = not applicable (because k was too small to conduct these analyses or because the variance component for the selection models indicated that the estimate was nonsensical [33]). (DOCX) S2 Table. Robustness of results and conclusions of the analyses (outlier excluded). Lowest value = lowest mean estimate from all analyses ( r o RE ; osr, r o FE , t&f r o , sm m r o , sm s r o , and PET-PEESE; we did not include the p-uniform values due to the lack of convergence with the results of the other, more established methods; likely due to the poor performance of this method with heterogeneous data [van Assen et al., in press]); r o RE = random-effects weighted mean observed correlation (the potentially best mean estimate); Highest value = highest mean estimate from all analyses ( r o RE ; osr, r o FE , t&f r o , sm m r o , sm s r o , PET-PEESE); BRE = Baseline range estimate: the absolute range between r o RE and the estimate farthest away (either the lowest or highest value); MRE = Maximum range estimate: the absolute range between the lowest or highest value. When calculating the relative difference of the range estimates, we used r o RE , the potentially best mean estimate, as the base (i.e., as 100%). Ideally, BRE and MRE should be identical. If not, outliers or other artifacts may have caused such differences. Practical difference: negligible = if the relative range (BRE or MRE) is smaller than 20%; moderate = if the relative range (BRE or MRE) is larger than 20%; large = if the relative range (BRE or MRE) is larger than 40% [33]. (DOCX) S3 Table. Robustness of results and conclusions of the analyses (including p-uniform estimates). Lowest value = lowest mean estimate from all analyses ( r o RE ; osr, r o FE , t&f r o , sm m r o , sm s r o , PET-PEESE, and p-uniform); r o RE = random-effects weighted mean observed correlation (the potentially best mean estimate); Highest value = highest mean estimate from all analyses ( r o RE ; osr, r o FE , t&f r o , sm m r o , sm s r o , PET-PEESE, and p-uniform); BRE = Baseline range estimate: the absolute range between r o RE and the estimate farthest away (either the lowest or highest value); MRE = Maximum range estimate: the absolute range between the lowest or highest value. When calculating the relative difference of the range estimates, we used r o RE , the potentially best mean estimate, as the base (i.e., as 100%). Ideally, BRE and MRE should be identical. If not, outliers or other artifacts may have caused such differences. Practical difference: negligible = if the relative range (BRE or MRE) is smaller than 20%; moderate = if the relative range (BRE or MRE) is larger than 20%; large = if the relative range (BRE or MRE) is larger than 40% . (DOCX) S4 Table. Moderator statistical tests using the between group Q test (outlier excluded). (DOCX)