Situational Judgment Tests as a method for measuring personality: Development and validity evidence for a test of Dependability

Gabriel Olaru; Jeremy Burrus; Carolyn MacCann; Franklin M. Zaromb; Oliver Wilhelm; Richard D. Roberts

doi:10.1371/journal.pone.0211884

Abstract

Situational Judgment Tests (SJTs) are criterion valid low fidelity measures that have gained much popularity as predictors of job performance. A broad variety of SJTs have been studied, but SJTs measuring personality are still rare. Personality traits such as Conscientiousness are valid predictors of many educational, work and life-related outcomes and SJTs are less prone to faking than classical self-report measurements. We developed an SJT measure of Dependability, a core facet of Conscientiousness, by gathering critical incidents in semi-structured interviews using the construct definition of Dependability as a prompt. We examined the psychometric properties of the newly developed SJTs across two studies (N = 546 general population; N = 440 sales professionals). The internal validity of the SJTs was examined by correlating the SJT scores with related self-report measures of Dependability and Conscientiousness, as well as testing the unidimensionality of the measure with CFA. Additionally, we specified a bi-factor model of SJT, self-report and behavioral checklist measures of Dependability accounting for common and specific measurement variance. External validity was examined by correlating the SJT scale and specific factor with work-related outcomes. The results show that the Dependability SJTs with an expert based scoring procedure were psychometrically sound and correlated moderately to highly with traditional self-report measures of Dependability and Conscientiousness. However, a large proportion of SJT variance cannot be accounted for by personality alone. This supports the notion that SJTs measure general domain knowledge about the effectiveness of personality-related behaviors. We conclude that SJT measures of personality can be a promising addition to classical self-report assessments and can be used in a wide variety of applications beyond measurement and selection, for instance as formative assessments of personality.

Citation: Olaru G, Burrus J, MacCann C, Zaromb FM, Wilhelm O, Roberts RD (2019) Situational Judgment Tests as a method for measuring personality: Development and validity evidence for a test of Dependability. PLoS ONE 14(2): e0211884. https://doi.org/10.1371/journal.pone.0211884

Editor: Timo Gnambs, Leibniz Institute for Educational Trajectories, GERMANY

Received: August 17, 2018; Accepted: January 23, 2019; Published: February 27, 2019

Copyright: © 2019 Olaru et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data files are availiable at: https://osf.io/uacb6/.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Situational Judgment Tests (SJTs) are low fidelity simulations that in recent decades have been widely adopted in the workforce for personnel selection [1]. SJTs typically present a situation describing a dilemma or problem along with different response options which test-takers evaluate using their knowledge, skills, abilities, and/or other characteristics (2). Indeed, numerous studies have demonstrated SJTs to be efficient–that is cheap and easy to create, administer and evaluate–and criterion-valid predictors of many work-related outcomes, such as job performance, interpersonal skills, or leadership (M_ρ = .20-.30) [2,3]. As a result, it has become very common in the workforce for employers to incorporate SJTs as one of their tools for personnel selection [1].

Although SJTs are already established as criterion-valid predictors of work-related outcomes [3–5], there is little consensus on what SJTs actually measure [6]. In addition to the original interpretation of SJTs as measures of tacit or job knowledge [7,8], SJTs have also been understood as predictive methods without a clear internal structure [9], as measures of situation-specific reactions [10] (but also see [11]), or as measures of dimensions, such as personality [2]. Jackson and colleagues [6] evaluated these perspectives by using variance decomposition [12] to identify relevant aspects captured with SJTs. Their results suggest that situations explain little variance in the SJT responses (i.e., around 1–3%) [12], as do domains (i.e., 0–6%). Instead, they found that the majority of SJT variance can be attributed to ability differences between respondents (i.e., 48–67%), which might be in line with the original definition of SJTs as measures of knowledge. However, the SJTs evaluated by Jackson and colleagues [6] were used as selection tools for job applicants, and were thus developed primarily with the intent of maximizing predictive validity. Christian and colleagues [2] suggest that SJTs can, and should, be developed with the goal of measuring specific constructs, which would arguably increase the trait variance captured by this assessment method. Newer studies that follow this approach have shown the potential of SJT measures of personality [13,14]. In this article, contribute to the ongoing discussion by developing SJT measures of personality (i.e., Dependability) and examine the construct validity of the newly developed measures.

SJT versus traditional self-report measures of personality

A reasonable question to ask at this point is how personality SJTs can contribute to research and practice, compared to self-report measures of personality or traditional SJTs. Personality traits, such as Conscientiousness, Emotional Stability, and Agreeableness, are well established predictors of many relevant life outcomes (e.g., life satisfaction, longevity) [15,16], as well as academic [17] and work-related performance [18]. For example, in education, a meta-analysis on the relations between cognitive ability and personality with academic outcomes has shown that in secondary and tertiary education, Conscientiousness is as important for academic performance as cognitive ability [17]. In the workplace, conscientiousness predicts important outcomes like job performance and job satisfaction [18–21]. Other personality factors such as Agreeableness and Neuroticism, can predict counterproductive work behavior and performance in teams [22]. As such, a single SJT measure of personality can be used to predict many different relevant outcomes, thus saving time and resources compared to developing specific SJT batteries for different outcomes. In addition, the rank-order stability of personality is high compared to, for instance, job knowledge [23], and as such, personality SJTs may also be better suited to predict future behavior. Developing a comprehensive SJT measure of personality thus enables researchers and practitioners to subsequently match relevant traits to outcomes and achieve reasonable predictive validity with a relatively small amount of work [2].

There are also several potential advantages of using SJTs to measure personality constructs as compared to using traditional self-report measures. First, SJTs are demonstrably less prone to faking than traditional self-report measures [24–26]. SJT scores showed much smaller mean level differences between faking and regular instruction conditions than self-report measures. The extent to which participants were able to increase their SJT scores seemed only to be related to cognitive ability, whereas faking in a self-report context is influenced by a magnitude of factors, for instance other personality traits [26]. SJTs also display less adverse impact than self-report Likert type scales for subgroups such as gender and ethnicity [5,27,28]. In addition, SJTs can also reflect subtler judgment processes by relating specific behaviors to situations, and may thus enhance the measurement of personality constructs. In a training context, SJTs can also be easily applied as formative assessments by elaborating the purposefulness or consequences of each response option in the respective context.

Nonetheless, we also want to point out that SJT measures of personality are not yet well established. While Mussel and colleagues [13] developed SJT measures of the NEO-PI-R facets [29] that correlate considerably with the original NEO-PI-R scales [30], ranging from a correlation of .41 for the Agreeableness facet Compliance to .70 for the Openness facet Openness for Ideas, Lievens and Motowidlo [31] suggested that the correlation between SJTs and personality can be attributed to a related, but distinct construct, namely the knowledge about the usefulness of having high or low levels of a given personality trait. This type of knowledge, referred to as implicit trait policies [32], represents the knowledge about the effectiveness of specific personality-related behaviors in the situations presented by SJTs. The theory of implicit trait policies argues that people with high levels on a trait also know about the utility of the trait related behaviors in specific situations. As such, these people will also be more likely to endorse these behaviors in SJT-type assessments. The small to moderate correlations found between many SJTs and personality traits [2,33,34] can thus be attributed to this implicit knowledge about the effectiveness of the traits and related behaviors. While we apply a construct-based approach in this study to develop SJT measures of personality, low correlations between the SJTs and classical personality measures may be indicative that the SJTs measure implicit trait policies instead.

Facets versus broad domains of personality

Broad trait domains such as personality factors should be seen as overarching second-order factors on top of more specific first-order factors–often labeled facets [29,35]. For example, the Big Five Factor Conscientiousness can encompass facets such as Dependability, Dutifulness, or Discipline. Measuring the specific underlying facets can be even more advantageous, for several reasons. First, as the content area of a facet (e.g., Dependability) is more specific than a domain (e.g., Conscientiousness), measurements of facets can capture elements of personality with a higher fidelity than scales based on the broad domains alone [36]. This also makes tests of personality facets easier to develop, as construct definitions are more specific than for broad domains. In addition, the more specific facet measures have shown to have higher test criterion evidence than the broad trait measures. Facet measures can show stronger relations to outcomes than general trait domains by capturing relevant aspects more precisely [9].

Dependability is a core facet of Conscientiousness and one of the best predictors of overall job performance of the Conscientiousness facets [19]. A person with high Dependability is reliable, responsible, fulfills obligations and respects authority. Dependability has been rated as the most valued work style or attribute by employers in the evaluation of the United States Department of Labor’s Occupational Information Network [21]. Dependability is ranked in the top 3 valued traits for 19 out of 23 job families covering approximately 1,102 occupations. These data provide support for the potential value of developing a Dependability SJT measure.

Current investigation

The main goal of this investigation is to further examine the validity of newly developed SJT measures of personality constructs in two studies. This was achieved by developing innovative SJTs following recommended best practices in SJT construction and conducting psychometric studies designed to evaluate the reliability and validity of these measures. We will examine whether the new construct-based personality SJTs are reliable and valid measures of the personality construct Dependability. We will also examine the impact of different scoring procedures on the psychometric quality of these types of SJTs. After construct validity has been established, we will examine the criterion-related validity of the new type of construct-based SJT as compared to typical self-report measures of personality.

Study 1

The main aim of Study 1 was to examine the psychometric quality of newly-developed construct-based SJTs. SJTs were developed to measure Dependability, a core facet of Conscientiousness. Two scoring procedures were compared, one based on expert ratings and one based on the sample distribution (i.e., consensus scoring). We examined the impact of the scoring procedure on construct validity evidence by relating SJT scores to other personality assessments, such as the Big Five Inventory [37], and on structural validity evidence through a one-factor confirmatory factor analysis (CFA) of the 18 SJT items (as we expected all 18 SJTs to measure a common Dependability factor). We then compared the Dependability SJT scores with scores derived from alternative measurement methods of Dependability (a self-report rating scale and a self-report biographical data questionnaire). To further examine whether SJTs capture individual differences in personality, we specified a multi-method CFA model accounting for common trait and specific assessment method variance across the three measures of Dependability. Under the assumption that the SJTs do indeed measure personality instead of implicit trait policies, we predicted the following results:

The SJTs will yield acceptable model fit and reliability for the one factor model encompassing all 18 SJTs.
The SJTs will correlate moderately with the Dependability self-report and biographical data questionnaires.
The SJTs will correlate moderately the BFI measure of Conscientiousness.
The SJTs will not correlate with the other Big Five factor scores.

Method

The study conforms to Standard 9 of the American Psychological Association’s Ethical Principles of Psychologist and Code of Conduct. The sample consists of adults that participated voluntarily in this study. Consent was informed. At the start of the study, participants were informed that they could abort the survey at any time and still receive full compensation. By beginning the study, consent was given. No personal identifiers (e.g., Social Security Number) were collected.

Participants.

Participants were 600 Amazon Mechanical Turk (AMT) workers who were residents of the United States. AMT has the benefit of providing fast recruitment of samples that are demographically more diverse than typical college or internet samples [38,39]. The quality of the data collected in AMT is reported to be at least as reliable as other data collection methods [38–40]. The majority of AMT workers also seem to participate for intrinsic reasons (e.g., entertainment) and may be more motivated to complete the tasks given. From our initial sample of 600 participants, we excluded 54 people (9%) who either did not complete the study or failed to provide correct answers to at least 3 out of 5 instructed-response questions designed to identify random or other forms of inattentive responding [41]. The mean age of the remaining 546 cases was 34.5 years (SD = 10.2). In this sample 293 participants were female. Half of the sample held at least a bachelor’s degree. Participants were given $4 for their participation in the 30-minute survey, which is much higher than the median AMT compensation rate of $1.38 per hour [42].

Measures.

Dependability SJTs. Semi-structured interviews were held with five individuals in full-time work (three males and two females), all but one of whom had obtained a university degree. The researcher took notes as the interviews progressed. The standard question prompt was varied to include content phrases indicating high and low levels of dependability: “Tell me about a time when you or a colleague of yours has <insert term from construct definition below>. What was the situation? What happened?” High dependability phrases included: been reliable, been responsible, been dependable, been industrious/hard-working; been efficient; been punctual; been consistent; shown a strong work ethic; been well-prepared; made and stuck to their plans. Low dependability phrases included: been unreliable, been lazy, been frivolous, wasted time; shirked their duties; not followed through on plans, left things unfinished. Follow-up questions asked for clarification of the behaviors, with the standard prompt “what did they do?” and requests for further detail regarding the context of the behavior if this was unclear. The high versus low descriptors were drawn from the O*Net descriptions of Dependability [43], and edited for clarity and ease of understanding. Based on these situation descriptors, three to five sentence descriptions of situations were created, along with five possible responses that intentionally varied from low to high dependability.

The situations were not contextualized to any specific profession, but reflected general work situations instead, such that the instrument would be relevant to a broad range of occupations, as well as work-readiness assessments for people entering the job market for the first time. As such, these situations have little reliance on occupational knowledge.

The behavioral instruction for the SJTs read, “How likely are you to respond with each of following actions?” Respondents answered to each response option on a 5-point Likert scale ranging from “Very Unlikely” to “Very Likely” An example of the resulting SJTs is presented below:

“You are asked to deliver a critical report to your supervisor by close of business today. At your company, reports such as this one are supposed to be prepared according to specific procedures and guidelines. If you follow all the steps in the order suggested, however, you will not meet the deadline.”

How likely are you to respond with each of following actions?

Keep working on the report, following all procedures and guidelines, and give your supervisor whatever you have completed by the end of the day.
Follow the procedures and guidelines and work into the night so you can deliver the completed report by start of business tomorrow.
Tell your supervisor that you cannot complete the report by close of business today.
Ignore the procedures and guidelines and do only the most essential parts of the report to meet the deadline.
Ignore the procedures and guidelines, but take as much time as you need to do the job.

We included a number of additional personality measures to examine the validity of our SJTs. In addition to including a well-established measure of the Big Five, we developed self-report and biographical data measures of Dependability to examine the construct validity of the SJTs with different assessment methods of the same construct in a multi-method design.

Big Five Inventory. The Big Five Inventory (BFI) [37] is a 44 item measurement of the Big Five trait domains. Each item (e.g. “I see myself as someone who does a thorough job”) is measured on a five-point Likert scale ranging from “Strongly Disagree” to “Strongly Agree”.

Dependability self-reports. We developed 30 self-report items measuring dependability (e.g., “I start tasks right away”, “I leave things unfinished”) based on the O*Net descriptions of Dependability [43]. The items were developed to capture all aspects listed in the definition of Dependability, thus providing a broad construct coverage. Half of the items were reverse coded. Each item was measured by a six-point Likert scale ranging from “Strongly Disagree” to “Strongly Agree”.

Dependability biographical data measure. We additionally developed 18 biographical data (checklist) items assessing past behavior (e.g., “Taken more than one day to return a phone call”, “Given someone useful advice”) with the instruction “To which extent have you engaged in each of the following behaviors in the last year?” Again, we tried to select behaviors that allowed us to capture all aspects of the Dependability definition. Each biodata item was answered on a six-point Likert scale ranging from “Never” to “Always”.

SJT scoring procedures.

Expert scoring. We asked four subject matter experts from industrial-organizational and personality psychology to rate each response option on the extent to which it was representative of Dependability, on a five-point Likert scale from “very undependable” to “very dependable”. Across all 89 response options (one was excluded for being a data-check item) the overall mean of the expert ratings was 2.99 (SD = 1.31; on a scale from 1 to 5), which suggests that the desirability of responses was evenly balanced across all SJTs. The Intra-Class Correlation between the four raters was .66.

To account for varying response styles (e.g., some people using the extreme ends of the scales, some using only one end of the scale), we intra-individually z-standardized raw scores across all SJT responses (i.e., a person’s ratings were converted to z-scores, so that each person had a mean of 0 and a standard deviation of 1 across all responses). The expert rating profile was also z-standardized. We then computed the absolute difference between the respondents’ and expert standardized scores on every response option. Scores were added up for every SJT. As higher scores reflect a higher deviation from the expert profile, scores were subsequently reversed by subtracting them from 0.

Consensus scoring. We computed the sample proportions in each response option and weighted the respondents’ selection based on these proportions. For example, if 32% of the sample chose “very likely to do” on a response option, this option will be scored with 0.32. Scores across response options were added up for every SJT. A simplified example both SJT scoring procedures can be found under https://osf.io/uacb6/.

Results

Dependability self-report and biodata scales.

We evaluated each of the newly developed scales by testing the model structure with CFA. We specified one-factor models for each scale and estimated the models using the MLR estimation in Mplus 7 [44]. The 30-item self-report dependability yielded insufficient model fit (χ² = 1,868; df = 405; CFI = .79; RMSEA = .08; SRMR = .06) [45]. However, the source of model misfit was unclear, as all items yielded sufficient loadings. One possibility might be the large number of indicators, which is often a problem for self-report scales [46]. We thus used the item selection algorithm Ant Colony Optimization [47,48] to identify the 18 items that would optimize the CFI and RMSEA value of the model. The resulting 18-item model fitted the data well (χ² = 322; df = 135; CFI = .94; RMSEA = .05; SRMR = .04) and yielded good factor saturation (McDonald’s ω = .93). The one-factor 18-item biodata model yielded bad model fit (χ² = 667; df = 135; CFI = .70; RMSEA = .09; SRMR = .09). Five items yielded factor loading close to zero, suggesting that these items do not measure Dependability. After removing these items, the 13-item model yielded acceptable model fit (χ² = 185; df = 65; CFI = .90; RMSEA = .06; SRMR = .05) and factor saturation (ω = .85). We thus used the shortened scales for the subsequent analysis. Factor loadings for the models can be found in the online repository under https://osf.io/uacb6/.

SJT scoring.

The Expert based and Consensus SJTs scores correlated around r = .80 (p < .01). However, model fit of the unidimensional CFA models differed strongly between the scores. We estimated one-factor models for both scoring procedures with MLR estimation. The Expert scores resulted in good model fit (χ² = 189; df = 135; CFI = .95; RMSEA = .03; SRMR = .04, ω = .78), whereas the Consensus scores showed poor fit to the data (χ² = 666; df = 135; CFI = .67; RMSEA = .09; SRMR = .08, ω = .80).

Correlation with personality scales.

Table 1 shows the correlations between the SJT scores and personality self-report measures. Consensus-based SJT scores yielded only small correlations with the self-report and biographical data measures of Dependability. The correlations with Conscientiousness as measured by the BFI was not significant. The Expert score showed moderate correlations with the other measures of Dependability (self-report: r = .47; p < .01; biodata: r = .29; p < .01) and the Conscientiousness measure (r = .33; p < .01). As expected, correlations with the other measures of Dependability are higher than correlations with the broad Conscientiousness factor measured by the BFI. While the Expert-scored SJTs correlate highest with the Conscientiousness factor in the BFI, the correlation with Agreeableness (r = .30; p < .01) is also substantial and close in magnitude to the correlation with Conscientiousness. This finding can be attributed to the social context of the SJTs, in which agreeable behaviors (e.g., helping others) are also indicative of Dependability. Note that correlations between self-report measures of Agreeableness and Conscientiousness (r = .42; p < .01) or Dependability (r = .45; p < .01) are also very high in this sample and might indicate social desirability effects.

Download:

Table 1. Correlations of the dependability scales with self-report measures of personality.

https://doi.org/10.1371/journal.pone.0211884.t001

Multi-method model.

To examine the unique proportion of variance in the SJTs compared to the other measures of Dependability, we estimated a bi-factor model on all three Dependability measures with a general Dependability factor and uncorrelated specific nested factors for SJTs, self-report and biodata measures (see Fig 1). The nested factors are intended to capture the unique method variance of each test format. However, note that the nested factors might also include differences in the construct coverage (we tried to minimize this by developing all three measures based on the O*Net definition of Dependability).

Download:

Fig 1. Multi-method bi-factor model of dependability.

SR = self-report; BD = biodata. The loadings presented represent the standardized loading range of the corresponding scales. Negative loadings on the SR and BD factors result from response effects (e.g., acquiescence) on negatively coded items. Model fit: CFI = .90; RMSEA = .04; SRMR = .05.

https://doi.org/10.1371/journal.pone.0211884.g001

Goodness-of-Fit indices of the model with MLR estimation were acceptable (χ² = 1,802; df = 1,078; CFI = .90; RMSEA = .04; SRMR = .05). The self-report items yield the highest loadings on the general Dependability factor (average λ = .70; see https://osf.io/uacb6/ for full loading structure) as well as lowest specific factor loadings (average λ = .19). In contrast, the loadings of the SJT items were stronger on the specific factor (average λ = .34) than on the general factor (average λ = .21), suggesting that a large portion of the SJT variance captures unrelated individual differences. Biodata items loaded slightly higher on the general factor (average λ = .36) than on the specific factor (average λ = .26). Table 2 shows the correlation between the four factors and the BFI scores. The overall Dependability factor correlated very highly (r = .82; p < .01) with BFI-C, supporting the notion that the three scales measure a central aspect of the trait. Correlations between BFI-C and the SJT and biodata factors were close to zero. The somewhat larger relationship between the self-report nested factor and BFI-C can be attributed to the method-effect of self-report items (correlation with the self-report factor: r = .44; p < .01), which are not present when using the SJT method (correlation with the SJT factor: r = -.06; p > .05). Correlations of the Dependability factor with the BFI-A scores were moderate (r = .41; p < .01), showing that the correlation between the Dependability scales and Agreeableness is mostly driven by similarities between the constructs or potential social desirability effects. The social aspect of the SJT situations does not seem to contribute to the zero-order correlation between SJTs and BFI-A shown in Table 1.

Download:

Table 2. Bi-factor model correlations with the bfi personality scores.

https://doi.org/10.1371/journal.pone.0211884.t002

Discussion

The CFA findings support the unidimensionality of the 18 SJT scores. The SJTs in this study were moderately related to self-report and behavioral frequency checklist measures of Dependability and Conscientiousness. While correlations with the other Dependability measures were similar to findings by Mussel and colleagues [13], the relatively low correlation with Conscientiousness and the low Dependability factor loadings in the multi-method model suggest that only a small to moderate proportion of the SJT variance is related to personality. There are several potential explanations for this effect: One explanation for this finding could be that the SJTs capture implicit trait policies [32] instead of the personality traits directly. The correlation between the SJTs and self-report measures of Dependability or Conscientiousness is also arguably reduced due to the scoring procedure applied. As we intra-individually z-standardized SJT responses and compared them to the expert profile, scale usage effects (e.g., acquiescence) are eliminated, whereas these might have artificially increased the correlation between the self-report scales. In addition, SJTs are also less prone to faking and social desirability effects compared to the traditional measures of personality. This might have further reduced the correlation between the different assessment methods. These explanations are also supported by the relatively high correlations between the different BFI scales. Surprisingly, the SJT correlations with self-reported Agreeableness were nearly as high as the correlation with Conscientiousness. However, as the multi-method model showed, this correlation can be attributed to the relation between Dependability and Agreeableness instead of specific SJT variance. The construct definition of Dependability also encompasses fulfilling obligations and respecting authority, which seem to be related to the Agreeableness facets Cooperation and Compliance. In comparison, the self-report scales of Conscientiousness correlated more highly with Agreeableness than the SJT scale (.42-.43 vs. .30), also suggesting a reduced impact of scale usage and social desirability in the SJTs.

The Consensus scoring procedure performed substantially worse than the Expert scored SJTs. Model fit was insufficient for the Consensus-based scores, and correlations with other measures of Dependability and Conscientiousness were substantially lower. Consensus scoring may be problematic in this context for a number of reasons. In a maximal performance setting, the scoring procedure is problematic for SJTs with higher difficulty, as they may not be correctly solved by a large proportion of the sample. The difficulty of SJTs can be artificially reduced or distorted, as responses are scored based on their perceived effectiveness by a sample with usually less insight than experts. When measuring typical behavior, this scoring procedure will result in more heterogeneous scores, as the responses do not converge towards an “optimal” or “correct” response. In addition, the Consensus scoring procedure will assign the highest score to participants that respond similarly to the rest of the sample, thus arguably favoring responses in the middle of the scale. In contrast, the Expert scoring is independent of scale usage effects because of the z-standardization and transforms the raw SJT responses into a difference metric based on a common expert profile. The resulting scores are thus much more homogenous than the Consensus scores.

Study 2

The goal of the second study is to replicate the findings from Study 1 and gather additional validity evidence for the newly developed SJTs by examining the criterion-related validity in a sample working in sales. Work-related outcomes were measured by assessing job performance–task performance (the percentage of sales objective and income goal reached last year) and counterproductive workplace behavior [49]–as well as variables that indicate workplace wellbeing (job satisfaction and turnover intentions).

In addition to examining construct validity in the same manner as in Study 1, we will examine whether the Dependability SJTs are capable of predicting work-related outcomes. Based on previous findings on the relationship between Conscientiousness and general job performance [18,19] or sales performance [18,50] we expect the Dependability SJTs as a measure of a core facet of Conscientiousness to correlate positively with measures of job performance. We also expect the SJTs to be positively related to work satisfaction [51] and negatively to counterproductive workplace behavior and turnover intentions [16,51–53]. In addition, we expect the SJTs to provide incremental validity in predicting performance beyond classical self-report measurements of personality [54,55]. In addition to the construct validity hypotheses proposed in the previous study we predict the following:

V. The SJT method will predict task performance measures incrementally beyond other measures of Dependability.
VI. The SJTs will predict counterproductive workplace behavior incrementally beyond other measures of Dependability.
VII. The SJTs will predict job satisfaction and turnover intentions incrementally beyond self-report measures of Dependability.