Personality traits and academic performance: Correcting self-assessed traits with vignettes

In this study, we investigate whether Conscientiousness, Emotional Stability and Risk Preference relate to student performance in higher education. We employ anchoring vignettes to correct for heterogeneous scale use in these non-cognitive skills. Our data are gathered among first-year students at a Dutch university. The results show that Conscientiousness is positively related to student performance, but the estimates are strongly biased upward if we use the uncorrected variables. We do not find significant relationships for Emotional Stability but find that the point estimates are larger when using the uncorrected variables. Measured Risk Preference is negatively related to student performance, yet this is fully explained by heterogeneous scale use. These results indicate the importance of using more objective measurements of personality traits.


Introduction
Personality is an important predictor of life outcomes (see, e.g., [1,2]). Personality is typically measured by asking individuals to evaluate self-reflective statements on subjective scales. One important issue with this approach is that people may systematically differ in scale use. If differences in scale use are correlated to outcomes, the relationship between measures of personality traits and such outcomes can be biased.
In this paper, we study whether people differ systematically in the values they attach to personality item scales and whether these systematic differences bias the relationship between measured personality traits and academic performance. We study biases with respect to Emotional Stability, Conscientiousness, and Risk Preference. The first two are personality traits from the Big Five Inventory. Risk Preference is an economic preference parameter and not a personality trait. However, for the sake of brevity, we refer in this paper to Conscientiousness, Emotional Stability, and Risk Preference as "personality." Scale use biases the relationship between self-reported personality and academic performance if it is related both to self-reported personality and to academic performance. The relationship between scale use and self-reported personality is obvious. The relationship between scale use and academic performance can occur, for instance, if students who score higher on achievement tests have different comparison groups in mind when completing the survey on personality traits than students who score lower on achievement tests.
To separate true differences in personality from differences in scale use, we employ anchoring vignettes (see [3,4] Stability, and Risk Preference. Then, we ask the same questions for a fictive hairdresser, surgeon, firefighter, and bus driver. We use the answers on the latter questions to identify heterogeneity in scale use. In a final step, we analyze whether correlations with academic performance differ for the uncorrected and corrected answers on personality traits. The main result of our analysis is that the relationship between uncorrected personality and academic outcomes is overestimated if differences in scale use are not taken into account. This conclusion holds for all traits we investigate. There is a vast literature on the relationship between personality traits and academic performance. For an overview, see, e.g., [1,[5][6][7]. Fig 3 in [1] (p. 1007) summarizes some of the main conclusions from this literature; of the Big Five traits, Conscientiousness is most strongly related to college grades. There is no strong correlation between Emotional Stability and grades. There is no consensus in the literature concerning the relationship between Risk Preference and academic performance (see the overviews in [1,8]).
Our paper contributes to the literature which studies whether answers on subjective scales differ from objectively measured indicators. For instance [9], investigate why workers in different Western countries report very different rates of work disability. They show that Dutch respondents have a lower threshold in reporting whether they have a work disability than American respondents [10] investigate the difference between stated physical activity and physical activity measured by accelerometers across different countries. The self-reported data show minor differences across countries while the accelerometer data show that the Dutch and English appeared to be much more physically active than Americans. Regarding life satisfaction [11], show that Americans are more likely to use the extremes of the scale than the Dutch, who are more inclined to stay in the middle of the scale.
As proposed in [12], an important challenge in personality psychology is finding objective measures or vignettes for personality traits to correct for bias due to scale use. Some previous research has applied anchoring vignettes to measure personality traits more accurately. [13] show that self-reports of non-cognitive skills are sensitive to survey administration conditions. Providing information about the importance of non-cognitive skills to students, for instance, directly affects their responses. [14] use anchoring vignettes to compare measures of Conscientiousness across 21 countries and show that country rankings of self-reported Conscientiousness to some degree result from differences in response styles. [15] show that the reliability of scales assessing Conscientiousness and Openness to Experience increases when using anchoring vignettes in a study of 12th-grade students in Brazil. [16] employ anchoring vignettes for the Big Five Inventory in Rwanda and the Philippines. They show that adjusted scores have better measurement properties relative to scores based on the original Likert scale. In their study, correlations of the Big Five personality factors with life satisfaction were essentially unchanged after the vignette-adjustment while correlations with counterproductive behavior were noticeably lower. We contribute to this literature by focusing on the relationship between personality traits and academic performance.
Our findings have important policy implications. The predictive power of personality and preferences for academic achievement found in earlier research may serve as a tool for policy makers. Personality and preferences can serve as signals for future performance and, therefore, policy makers can help children with adverse personality traits or preferences, e.g. by providing more support to such children. Virtually all papers studying the relationship between personality and achievement have used subjective measures of personality. Our paper indicates that the predictive power of subjective measures of personality may be overestimated. The correlations between personality and academic performance will be more useful for policy makers if more objective measures of personality traits are employed.
The set-up of this paper is as follows. Section 2 discusses the data. Section 3 shows the empirical strategy. Section 4 reports the results. Section 5 concludes.

Ethics statement
Ethics approval was obtained by the Ethical Review Committee Inner City faculties of Maastricht University (ERCIC_044_14_07). Participants of the research all provided written consent.

Sample
We collected data among students at a Dutch Economics and Business school in the first course of the first year. In the Netherlands, curricula of Economics and Business programs typically share several introductory courses, giving rise to classes counting 1000 students or more. The data collection consisted of two parts. The first was held as a mandatory assignment in an introductory course in quantitative methods. This implied that all active students participated in the survey. In this first part, questions about Risk Preference were included. As a result, we have information on Risk Preference of 1056 students. We also have information on the grade obtained in this course of all these students. Risk Preference was measured twice: in week 1 and week 6 of the seven-weeks course. Because we measure Risk Preference vignettes in week 6, we also use data on Risk Preference from week 6. Using the data on Risk Preference from week 1, however, yields very similar results. This implies that Risk Preference remains relatively stable during this education period (the correlation between the measures is 0.555, p<0.0000) and therefore, that reverse causality does not appear to play a role. In the analyses on Risk Preference, we furthermore restrict the sample to those who also answered the questions on Conscientiousness and Emotional Stability in order to show estimations on the same sample for all traits. If we do not restrict the sample, we again find very similar results.
A second survey was held one week after the course ended. Participation in this survey was voluntary. We gave every fifth participant 10 euros and held a lottery among all participants with a 1000 euros prize. In total, 625 students participated. We have information about these students on their Conscientiousness and Emotional Stability. Table A in S1 File, presents summary statistics on the differences between respondents who participated in the first and second survey and respondents who participated in the first but not the second survey. For the observables we investigate, the groups appear to be comparable although grades are significantly higher in the in-sample and there are slightly more women in the in-sample than in the out of sample group. Note that we cannot compare the level of Conscientiousness or Neuroticism between these two groups since this information was only collected in the second survey.

Measuring personality
We follow the non-parametric approach developed by [3] and in more detail described by [4]. In their non-parametric approach, they use anchoring vignettes to adjust for respondents' scale use. In this way, it is possible to recode variables such that respondents' answers are on the same scale.
We first measure personality of a respondent. For Conscientiousness, we pose the statement: "I am always prepared." For Emotional Stability, we pose the statement "I get stressed out easily." For Risk Preference, we ask: "Are you generally willing to take risks, or do you try to avoid risks?" Response categories on all statements and questions range from 1 "I fully disagree" to 7 "I fully agree." The question we use to measure Risk Preference is taken from the German Socio-Economic Panel and validated by [17]. They show that responses to the survey question predict behavior in incentivized choices under risk. The items we use to measure Conscientiousness and Emotional Stability are taken from standard questionnaires to measure Big Five constructs [18]. In personality psychology, it is common to use more than one question to measure a trait. However, it is important to assess the differences in scale use for personality items separately. Different items may induce different response mechanisms. The wording of items can evoke different interpretations between groups, for example. Grouping items together to reduce measurement error presupposes a common error structure (e.g., all measurement error is independent) of the items. In this paper, we argue the opposite and focus on the item-level to analyze scale use that is specific to the item. As such, we need a vignette for each question we pose. Due to space limitations in the survey, we can therefore only include one item to measure the trait.
Secondly, we ask the same questions, but for fictive persons. For instance, for Conscientiousness, we state the following: "Imagine a surgeon. To what extent does the following statement apply: A surgeon is always prepared." We have two vignettes for this item of Conscientiousness: "A hairdresser/surgeon is always prepared." For Emotional Stability, we have two vignettes on one item of the trait: "A hairdresser/surgeon gets stressed out easily." For Risk Preference, we have three vignettes: "Is a bus driver/firefighter/hairdresser generally willing to take risks, or does (s)he try to avoid risks?" We then use the answers for the fictive persons as anchoring vignettes for the answers on the subjective question about oneself. The idea is that the way in which subjects report about their own latent trait coincides with the way they report about the latent trait of a "generic" other (e.g. hairdresser). For example, individuals may differ in their interpretation of the trait or answer categories as described in the self-assessment. As the vignette contains the same description of the trait, the vignette presumably captures the same scale use as the self-assessment. In addition, the tendency to agree with statements independent of their content-i.e. acquiescence bias-is captured by both types of assessments. Moreover, the use of vignettes allows us to correct for reference bias. In particular, an individual may evaluate his or her personality in comparison to a reference group (e.g., those who are close like friends, family and colleagues) such that the response portrays an individual's stance within the reference group. If personality traits in the group are correlated, then someone might report a low score even though they have a high value at the population-level. By introducing a vignette, such errors are corrected.
To do so, we assume a logical order of the vignettes. In the case of Conscientiousness, we assume that surgeons are more conscientious than hairdressers. For the respondents who rated the vignettes in this logical order, we can correct the personality trait by recoding the variable in the following way: C = 1 if y < v1; C = 2 if y = v1; C = 3 if v1 < y < v2; C = 4 if v1 < y = v2; C = 5 if y > v2, where y is the respondents' self-assessment on the personality trait, v 1 is the response for the same respondent for a fictive hairdresser and v 2 is the response for the fictive surgeon. For Emotional Stability, we assume that a surgeon is emotionally more stable than a hairdresser. For Risk Preference, we assume that a firefighter is more willing to take risks than a bus driver.
In principle, it is an interesting question whether surgeons are observed to be on average more conscientious/emotionally stable. However, for our paper, this is not very important. Crucial for our analyses is that students perceive surgeons to be more conscientious/emotionally stable than hairdressers, or at least they perceive themselves to be on a different level of Conscientiousness/Emotional Stability than those occupations, because the vignettes are mainly about putting everyone's answers on the same scale. The following percentage of the students indeed do perceive the surgeons to be more conscientious/emotionally stable than the hairdressers: 79.6% of the sample (Conscientiousness) and 56.2% (Emotional Stability). When we allow for a different ordering, these percentages increase to: 94.8% of the sample (Conscientiousness) and 86.0% (Emotional Stability).
Perceptions of the vignettes may differ depending on the occupation of the respondent. E. g., a hairdresser might perceive a hairdresser's Conscientiousness differently than a surgeon. These influences may be somewhat weaker in the sample, as the respondents are economics students. This firstly means that they are very much alike such that they would all suffer the same "misperception." Differences between students can then still be interpreted as differences in scale use. Secondly, because we focus on first-year students of Economics and Business, these respondents are not in an occupation yet, besides some small jobs on the side, which are very unlikely to include hairdressing, let alone work as a surgeon. Therefore, we expect that the vast majority of the students will indeed consider service occupations and medical occupations in general.
Respondents do not always follow the intended rank of the vignettes: i.e., either they tie vignettes (e.g., hairdressers are equally conscientious as surgeons) or they misplace them (e.g., surgeons are less conscientious as hairdressers). This leads to the 13 options in Table 1. For Risk Preference, we show two sets of two vignettes in this table: (A) bus driver and firefighter; and (B) hairdresser and firefighter. Alternatively, we can use all three vignettes, but we do not show this in Table 1 since using three vignettes yields 75 ordering options.
The bottom rows of the table show how many respondents order vignettes in a logical way. Using the strictest definition of logical order, 80.6% of the respondents ordered the vignettes for the Conscientiousness item "Always prepared" in a logical way. For Risk Preference, both  A and B). Percentage "completely correct ordering" is the share of the sum of rows 1-5 relative to the total number of respondents. Percentage "without intervals" is the share of the sum of rows 1-5, 6, 8, 9, and 13. Row 7 shows the case when respondents rate themselves the same as both vignettes. The only thing one can conclude from this is that the new rating of the respondent will definitely not be lower (1) or higher (5) than both vignettes. It is, however, impossible to distinguish between the values 2, 3 and 4, because we cannot put the own value lower or higher than any of the vignettes. In the case of row 10, we only know that the respondents have rated themselves lower than vignette 1, and the same as vignette 2. This only excludes the value 5. Row 12 is the mirrored image of this. Row 11 is the option which gives no information at all because there are both no 'ties' and the ordering of the vignettes is wrong. . For Emotional Stability, we find that fewer respondents logically order their answers. A reason may be that some respondents do not think about a person when answering the questions but about a profession. So, they think for instance, about surgeons working in a stressful environment instead of that a surgeon may be more stress-resistant. This rescaling technique uses the ordering of the answer relative to the vignettes, which is more objective than the answers on the subjective question. The anchoring corrected personality trait has 5 values: the value 1 if row 1 applies, 2 if row 2 applies, etc.
Besides the most logical ordering of the first five rows, rows 6, 8, 9, and 13 may also be plausible answers since the own value and the vignettes are also separable in these rows. In rows 6 and 8, respondents value the vignettes similarly, but because the own value is above (below) both vignettes, one can still conclude that the own value is lower (higher) than both vignettes. Therefore, for these rows, we can scale the anchoring corrected variable to 1 (5). In rows 9 and 13, the vignettes have been reversed by the respondent, but because the own value is lower (higher) than both vignettes, one can conclude that this is lower (higher) than both vignettes, and therefore we can rescale the answer to 1 (5).
Using this additional insight, we define two samples. The baseline sample contains only respondents with answers 1-5, and the extended sample additionally includes rows 6, 8, 9, and 13. All other answers are less plausible and will be excluded from the analyses.
Tables B-D in S1 File, give summary statistics of the samples for respectively Conscientiousness, Emotional Stability, and Risk Preference.

Subjective and anchoring vignette corrected measures of personality
We first estimate the following type of functions: in which P i is the personality trait: Conscientiousness, Emotional Stability, or Risk Preference of individual i. X i indicates a set of explanatory variables: math major, math entry test score, female, age, parental education, nationality (Dutch (reference group), Belgian, German, other nationality), and study field (Economics and Business (reference group), Fiscal Economics, International Business). The relationship of these variables with personality traits is captured by the vector δ 0 . The error term is denoted as ε i .
We run this type of regressions separately for subjective personality traits, and for anchoring vignette corrected personality measures. We do this both for the baseline and the extended sample. We make use of ordered probit models due to the ordinal nature of the dependent variable. The results indicate if relationships between subjective personality and explanatory variables are robust when using the anchoring vignette corrected personality measures.

Relation between grade and personality
Our main analyses are given by the function: in which G i is the grade obtained in the quantitative methods course. β 0 measures the effect of personality traits on this grade. The relationship of background variables with personality traits is captured by the vector α 0 . The error term is denoted as μ i . Similar to the first equation, we estimate this function separately for the three personality traits, which are in turn measured subjectively and anchoring vignette corrected respectively, and we do this both for the baseline and extended sample. We use OLS regressions with fixed effects for the student's study program. The distribution of the dependent variable Grade is between 1-10 with steps of 0.5-grade point. With 19 different options, we see OLS as best suited in this case. The results indicate if relationships between grades and subjective personality measures are robust when using the anchoring vignette corrected personality measures.

Results
The first part of the empirical results shows whether the student characteristics are related differently to corrected and uncorrected personality measures. Tables 2-4 give the results of two regressions: One between the uncorrected personality measures and characteristics X 0 , and another between the corrected personality measures and X 0 . By doing so, we follow [3] and [9] who show that background characteristics explain both the latent trait and scale use. It follows that scale use can differ structurally between individuals.
In the tables, column 1 shows the results for the full sample. We focus on comparing results of corrected and uncorrected measures of the personality traits in the baseline (columns 2 and 3) and the extended sample (columns 4 and 5). Regarding the results for Emotional Stability  (Table 3), for instance, Column 2 shows that Belgian students report lower scores of Emotional Stability than Dutch students in the baseline sample (given other characteristics). However, column 3 reveals that this relationship is not robust when using the corrected measure for Emotional Stability. In the extended sample, we also find that this relationship is less strong when using the corrected measure than when we use the uncorrected measure. These results are important as they reveal that people differ in their scale use systematically. For example, nationality may affect how you interpret the scale or the text that describes traits and behaviors. If such systematic differences exist, it becomes questionable whether the relationship between uncorrected personality measures and outcomes such as academic performance represents the true underlying relationship. Academic performance in itself may alter one's scale use, for example, such that the linear relation is under-or overestimated. Moreover, any other covariates used to predict academic performance may yield biased estimates as they covary both with the latent trait and scale use. In sum, showing that people differ in scale use depending on their background characteristics provides evidence that using uncorrected measures as predictors may be unwarranted.
Tables 5-7 report the relationships between grades and personality traits. For instance, Table 7, column 2, shows that willingness to take risks appears to be strongly related to grades. Note: Each column reports a regression with Emotional Stability as the dependent variable (standardized to a mean of zero and standard deviation 1) and the variables in the rows as the independent variables. The baseline sample contains only respondents with answers 1-5 of Table 1, and the extended sample additionally includes rows 6, 8, 9, and 13. Standard errors are reported in parentheses ��� p<0.001 �� p<0.01 One standard deviation higher willingness to take risks is related to a grade around 0.26 points lower on a scale of 1-10. However, column 3 reveals that the relationship is no longer significant when using the corrected measure of Risk Preference. Therefore, there is a large overestimation (i.e., in absolute sense) of the relationship of Risk Preference and grades if the vignette is not used. These results also hold if we use the extended sample. Tables 5 and 6 show the results for Conscientiousness and Emotional Stability. Comparing columns 2 and 3 in these tables reveals a similar pattern as we find for Risk Preference; the point estimates of the relationships between these personality traits and grades are much larger if the vignette is not used than if it is used.
In this analysis, we allow "incorrect" orderings by students to be included. In Tables 5-7 we compare (a) the original Conscientiousness/Emotional Stability/Risk Preference measure for all respondents, (b) the original Conscientiousness/Emotional Stability/Risk Preference measure for those respondents with the correct ordering on the vignette (Conscientiousness: 79.6% of the sample; Emotional Stability: 56.2%; Risk Preference: 94.0%), (c) the new vignetteadjusted measure of Conscientiousness/Emotional Stability/Risk Preference for those respondents with the correct ordering on the vignette (Same proportion as (b)), (d) the original Conscientiousness/Emotional Stability/Risk Preference measure for the respondents from (b) plus The results remain robust independent of the usage of the baseline sample or the extended sample, with the exception of Conscientiousness, for which the coefficient becomes somewhat larger and the p-value turns below 0.05 (reassuringly, the two coefficients are not statistically different from each other).

Conclusions
This paper investigates the relationship between student performance in higher education and Conscientiousness, Emotional Stability and Risk Preference. We employ anchoring vignettes to correct for heterogeneous scale use in these non-cognitive skills. Our main result is that if scale use is not taken into account, the relationship between academic achievement and personality traits is overestimated. This holds for all traits we investigated. Our results have important implications for the literature studying the relationship between personality and outcomes such as school achievement. Previous research has shown strong correlations between personality and such outcomes, but our paper suggests that these correlations may be overestimated. Using vignettes or more objective personality measures is important to get unbiased estimates of the predictive power of personality for outcomes in life.
One of our research limitations is that our sample of college students is not representative of the full Dutch population. Future research is needed to see if the results are robust in representative samples. A second limitation is that the anchors we use may in themselves also be subject to bias. For instance, students who have hairdressers in their direct environment may judge hairdressers' personality traits to be different from students who do not have hairdressers in their direct environment. Moreover, having a hairdresser in the direct environment may be related to their academic outcomes. In this case, the relationship between corrected traits and academic outcomes may be biased. We control for parental education levels so this issue may not lead to much bias in our estimations, but the example does point out that future work needs to focus on finding anchors which are more objective than ours and are not susceptible Note: Each column reports a regression with grades as the dependent variable and the variables in the rows as the independent variables. Emotional Stability (the reverse of "I get stressed out easily") is standardized to a mean of zero and standard deviation 1. The baseline sample contains only respondents with answers 1-5 of Table 1, and the extended sample additionally includes rows 6, 8, 9,  to structural differences in vignette perception. Finding the perfect anchors has not been our focus. Our work, instead, serves as a first step in investigating the extent to which using more objective measures of personality traits influences the estimated relationships between traits and outcomes.

Author Contributions
Conceptualization: Johan Coenen, Bart H. H. Golsteyn, Tom Stolp. Note: Each column reports a regression with grades as the dependent variable and the variables in the rows as the independent variables. Risk Preference ("Are you generally willing to take risks, or do you try to avoid risks?") is standardized to a mean of zero and standard deviation 1. The baseline sample contains only respondents with answers 1-5 of