Cross-cultural examination of the Big Five Personality Trait Short Questionnaire: Measurement invariance testing and associations with mental health

The present study examined the measurement invariance of the Big Five Personality Trait Short Questionnaire (BFPTSQ) across language (Spanish and English), Spanish-speaking country of origin (Argentina and Spain) and gender groups (female and male). Evidence of criterion-related validity was examined via associations (i.e., correlations) between the BFPTSQ domains and a wide variety of mental health outcomes. College students (n = 2158) from the USA (n = 1117 [63.21% female]), Argentina (n = 353 [65.72% female]) and Spain (n = 688 [66.86% female]) completed an online survey. Of the tested models, an Exploratory Structural Equation Model (ESEM) fit the data best. Multigroup ESEM and ESEM-within-CFA generally supported the measurement invariance of the questionnaire across groups. Internalizing symptomatology, rumination and low happiness were related mainly to low emotional stability across countries, while low agreeableness and low conscientiousness were related chiefly to externalizing symptomology (i.e., antisocial behavior and drug outcomes). Some correlational differences arose across countries and are discussed. Our findings generally support the BFPTSQ as an adequate measure to assess the Big Five personality domains in Spanish- and English-speaking young adults.


Introduction
According to a biopsychosocial model of psychopathology, several biological, psychological and social variables have been indicated to impact health outcomes [1]. One of the nonspecific distal psychological variables that influences psychopathology development is personality [2]. Personality traits have been associated with other outcomes, such as happiness [3], academic and job performance, antisocial and criminal conduct [4], and a broad spectrum of healthrelated behaviors [5,6].
The Five-Factor Model (FFM; a.k.a. Big Five) is one of the most widely accepted structural personality models [7]. The FFM proposes five broad personality traits: openness to experience, extraversion, agreeableness, conscientiousness, and neuroticism (or its positive pole, emotional stability). Openness represents individual differences in curiosity, fantasy, appreciation of art and beauty, and social attitudes. Extraversion reflects individual differences in sociability, social ascendency, activity, excitement seeking, and positive emotionality. Agreeableness reveals individual differences in compliance, empathy, collaboration, and altruism. Conscientiousness represents individual differences in being methodical, planning, impulse control, and respecting and abiding by conventional social norms and rules. Neuroticism refers to individual differences in the tendency to experience frequently and intensively negative emotions, such as anxiety, fear, depression and irritability, as well as having low self-esteem [8].
Traditionally, measures to assess personality traits encompass many items and are, therefore, time-consuming. Several 8], have been developed. These brief versions are particularly useful when there is limited administration time and/or when the target population's characteristics (e.g., adolescents, elderly) impede the use of full versions.
Among these short personality measures, the BFPTSQ [8] has a number of advantages. First, it has wider conceptual breadth (i.e., content validity) than most available measures, particularly very short ones. Indeed, many available short personality measures suffer from limited conceptual breadth, which essentially means that these measures do not represent a number of important lower-order or primary traits [see 16]. When developing the BFPTSQ, Morizot built it from the initial pool of the English BFI items [7,10], but added 8 new items to tap into important primary traits that were missing in the original short measures (e.g., sensation seeking, impulsiveness, openness to cultural differences, etc.). Further, 2 items that could generate confusion were deleted ("prefer work that is routine" and "generates a lot of enthusiasm"). Second, 36 items were reworded to be more easily understood than in the original BFI, so they may be utilized in assessments with both adolescents and adults (S1 Table presents the item correspondence between the BFI and the BFPTSQ). This is particularly important for long-term longitudinal studies because they often employ different measures for adolescents and adults depending on the participants' ages at the time points when assessments are conducted. In such cases, determining whether differences are due to real changes in traits or to the measure taken to assess personality during different developmental periods is no easy task. Finally, this instrument is in the public domain, so it can be used freely by researchers for applied purposes.
Originally, the psychometric properties of the BFPTSQ were examined in French-speaking adolescents from Quebec [8]. The French BFPTSQ scores showed adequate psychometric properties, including evidence for content and structure validity and adequate internal consistency [8]. The BFPTSQ scores also correlated with the NEO-PI-3 [14], supporting its convergent validity. Finally, criterion validity was demonstrated by predicting psychopathology symptoms (i.e., conduct disorder, major depression disorder, attention deficit hyperactivity disorder, bipolar disorder, oppositional defiant disorder, social phobia, substance use and generalized anxiety disorder) and academic achievement (i.e., grade point average). Moreover, this version was found to be invariant across gender groups [8].
The Spanish BFPTSQ was adapted and validated in a sample of Spanish adults [17]. Findings supported not only the structure reported by Morizot [8], but also the criterion-related validity (e.g., correlations between emotional stability and extraversion with happiness, and low conscientiousness and extraversion with alcohol consumption). Notably, these are the only two studies that have examined the psychometric properties of the BFPTSQ. To our knowledge, no previous work has examined the adequacy of the English BFPTSQ version to assess personality traits in English-speaking populations to date. Interest in understanding human psychology and behaviors outside traditionally studied cultures has increasingly grown [i.e., Western populations; 18,19], particularly by conducting cross-cultural research. However, a first key step to conducting cross-cultural research is to demonstrate that a questionnaire works in similar ways (i.e., measurement invariance) across countries, languages or other groups (e.g., gender). Only when measurement invariance is met is it legitimate to make valid comparisons of results across groups. Lack of measurement equivalence can lead to biased conclusions being drawn about potential cross-cultural differences [20].
In addition, although there is currently an increasing demand for short scales, their construction is not exempt of difficulties [21]. Following recommendations proposed by Ziegler et al. [21], and taking into account that our research purpose is assessing the Big Five in college/university students from different countries, we explored the factorial validity of the BFPTSQ with rigorous statistical strategies: 1) a structural equation modeling (i.e., ESEM) approach, 2) a calculation of different internal consistency indices (rather than just Cronbach's alpha), and 3) correlational analyses across groups examining the empirical evidence supporting the interpretation of the test scores (i.e., criterion validity).
Specifically, in the present study we: (a) test the BFPTSQ structure in two different Spanishspeaking countries (Argentina and Spain); b) test the structure of the English BFPTSQ version; (c) test the measurement invariance across countries (Argentina vs. Spain), languages (English vs. Spanish) and gender groups; (d) explore the internal consistency of the scales among groups; (e) examine the associations among the five BFPTSQ domains and a large number of psychological constructs (i.e., psychopathology, antisocial behavior, marijuana use and negative marijuana-related consequences, rumination and happiness) in college students from the USA, Argentina and Spain (i.e., criterion-related validity). We focused on this set of variables because substance use [22,23] and mental health problems [24][25][26] are particularly insidious among college students. Therefore, a valid measure like the BFPTSQ will facilitate the crosscultural examination of personality traits and their associations with a large set of outcomes in college students from different cultures/countries. It will also be useful for identifying college students at more risk of developing substance-related and mental health problems.

Participants and procedure
College students from one university in Spain, one university in Argentina, and two universities in the USA completed an online survey on personality traits, personal mental health and marijuana use behaviors [for more information see 27]. Although 2192 college students completed the BFPTSQ, only those cases with less than 5% of missing values were retained. After deleting these cases (n = 29), and the five cases who failed to report their gender, the final sample included 2158 undergraduate students. Table 1 presents the descriptive statistics of the three samples.
Before the assessment of the participants, the ethic committee of the Universidad de Córdoba and Universitat Jaume I approved the study, as well as the Collaborative Institutional Training Initiative (CITI program) in the USA universities (ID: 21636999 and 21637000).

Measures
At all the university sites, the participants were administered the questionnaires below.
Big Five personality traits. Personality traits were assessed with the 50-item Big Five Personality Trait Short Questionnaire [BFPTSQ; 8] at the US universities, and the Spanish version [17] at the sites in Argentina and Spain. This measure assesses the FFM personality traits on a 5-point Likert-type scale (0 = Strongly Disagree, 4 = Strongly Agree): openness, extraversion, agreeableness, conscientiousness and emotional stability. In the present study, all the reversed items were indicated with an r after the item number (e.g., 31r). Responses were summed on all five scales and divided by the number of their items [10]. Thus, the scale scores in the present study ranged from 0 to 4. Mental health. Past 2-week psychopathology was assessed using the 23-item DSM-5 Self-Rated Level 1 Cross-Cutting Symptoms Measure-Adult [29]. For Spanish-speaking students, the Spanish version was administered [30]. Participants are asked, "During the past two weeks, how much (or how often) have you been bothered by the following problems?" and responded on a 5-point response scale (0 = none, not at all, 1 = slightly or rarely, less than a day or two; 2 = mild, several days; 3 = moderately, more than half the days, 4 = severely, nearly every day). A score of 2 or higher in most domains, except substance use (score of 1 or higher), is suggestive of clinically-relevant mental health problems [31]. The measure has been validated with both clinical [31] and college students [26] samples.
Significant differences in the prevalence rates of the following symptoms between countries were found: depression (US: 29.59%, Arg: 38.53%, Sp: 40. Antisocial behavior. Antisocial behavior was assessed with the Antisocial Behavior Scale [ABS; 32]. The ABS contains 35 items that describe various antisocial behaviors (i.e. "I have broken, ripped, or damaged public properties" or "I have used knives or sticks in fights") on a 4-point response scale (1 = Never or Almost Never, 4 = Very Frequently or Very Often). Summing the responses to all the items provides a total score. A previous project undertaken by the research team translated the ABS into and adapted it to English. The preliminary results revealed that the scores for the Spanish and English ABS versions displayed good internal consistency. The various differential item functioning analysis indicated that items generally operate similarly across the three participating countries [33]. The Cronbach's alpha of the scale in the current total sample was .93, and by country were: .95 US, .87 Argentina and .92 Spain.
Negative marijuana-related consequences. The 21-item B-MACQ was employed to assess negative marijuana-related consequences [34]. All the items scored dichotomously to reflect the absence/presence of any marijuana-related problem in the last month (0 = no, 1 = yes). The total score reflects all the consequences that individuals experienced in the last 30 days. Previous research supports the test-retest reliability, as well as the discriminant and convergent validity, of the B-MACQ [34], and has also measured invariance and criterion validity across cultures and languages [i.e., 27]. The Cronbach's alpha of the scale in the current total sample was .87, and by country were: .89 US, .81 Argentina and .86 Spain.
Marijuana use. Frequency of marijuana use was assessed by this question: "How many days in the last 30 days have you used marijuana?" If the participants responded 1 or higher, they completed the marijuana quantity measure. To report the consumed amount of marijuana, the participants were administered a visual guide indicating several amounts of marijuana in grams. Their typical weekly marijuana use in the last 30 days was assessed by the Marijuana Use Grid [MUG ; 35]. The participants were asked to estimate the amount of marijuana they used in grams during each 4-hour period per day of a typical week. By adding all the values, an estimate of the typical amounts of marijuana used was made, which reflected the total grams marijuana they used in a typical week.
Rumination. Rumination was measured by the Ruminative Thought Style Questionnaire [RTSQ ; 36]. The participants were asked to express how well each item described them on a 7-point response scale (1 = Not at all, 7 = Very Well). In Argentina and Spain, the Spanish RTSQ version was utilized [see the translating and adaptation procedures in 37]. According to the former findings obtained with the USA, Argentinian and Spanish samples, a 15-item version of this measure was employed, which proved invariant across genders and countries [37]. The Cronbach's alpha of the scale in the current total sample was .94, and by country were: .95 US, .94 Argentina and .94 Spain.
Happiness. One question was about general happiness. The participants had to respond about how happy they felt in general that day (by attempting to ignore any feelings they had yesterday) on a 10-point scale (1 = Completely Unhappy to 10 = Completely Happy).

Statistical analysis
All the analyses were done with version 25 of SPSS and version 7.4 of Mplus [38]. The robust maximum likelihood estimator (MLR) was used in each analysis conducted in Mplus. The MLR provides adjusted standard errors and statistical fit tests that are robust to data non-normality. The 99% confidence intervals (CI) of the relevant estimates were calculated and reported. Two model types were employed to assess factor validity: the ICM-CFA (independent clusters model confirmatory factor analysis) and the ESEM (exploratory structural equation modeling) with target loading rotation.
In line with Marsh et al. [39] and Morizot [8], all the factor models were estimated both with and without a priori correlated uniquenesses, employed to reflect that some items relate to the same primary trait or subdomain, and they share either a similar content, but reversed scores, or contain the same word. Twenty-seven a priori correlated uniquenesses were posited. Specifically, the correlated uniquenesses introduced for openness were: and for emotional stability: 10 with 35, 10 with 15r, 5r with 25, 5r with 45r, 30r with 50r. A detailed description of the conducted ESEM and ICM-CFA models can be found in Morizot [8].
The model fit assessment was made according to various indices [40]. The chi-square test was run for all the models. Although a nonsignificant chi-square indicates a good fitting model, this test is generally too sensitive with large sample sizes. Thus, other fit indices were calculated. Values of .08 or lower for the root mean square error of approximation (RMSEA), values of .90 or more for the comparative fit index (CFI) and Tucker-Lewis index (TLI) and values of .10 or less for the standardized root mean square residual (SRMR) suggest acceptable model fit [41,42]. For the RMSEA 90% CI values, those under .05 for the lower bound and under .08 for the upper bound indicate acceptable fit [43].
After identifying the best factor model, factor structure was tested in each country, and measurement invariance was tested between the Spanish-speaking groups (Argentina and Spain), the Spanish and English versions, and across gender groups using multi-group ESEM. These models were assessed with a series of increasingly stringent multiple-group models (see [8]): configural invariance (MG1; all the loadings, intercepts and uniquenesses are freely estimated, with latent variances being constrained to 1 and latent means to 0), metric invariance (MG2; loadings constrained to invariance to make free estimations of the factor variances in one group), scalar invariance (MG3; intercepts constrained to invariance, to make free estimations of the factor means in one group), strict invariance (MG4; uniquenesses constrained to equality), correlated uniquenesses invariance (MG5), variance/covariance invariance (MG6; they must all be done simultaneously in ESEM), and latent means invariance (MG7). For all the models in this sequence, the imposed constraints are additive and the preceding model acts as a reference.
If there is evidence for noninvariance of the factor loadings across groups, as partial factor loading invariance cannot be tested in ESEM [44,45], an ESEM-within-CFA (ES-W-C) multigroup model was utilized. For the ES-W-C model, all parameter estimates from the ESEM solution were used as starting values. In addition, we added a total of 25 constraints (the square of the number of factors) to the ES-W-C model so that it was identified. Specifically, the 5 factor variances for the first group of the multiple group solution and the 20 "anchor items" were fixed. The anchor item or referent indicator for each factor is the item that has a large loading for the factor that it is designed to measure and small cross-loadings on other factors. Then these small cross-loadings were fixed to their values from the ESEM solution. This allowed a higher level of convergence with the ESEM solution. For all other parameter estimates, the patterns of the fixed and free estimates were the same as in the selected ESEM solution [44]. It is noteworthy that, in ES-W-C, the factor variances were fixed to one in the first group to identify the model. Then the covariances invariance across groups was tested, rather than the variance/ covariance invariance.
To assess changes in the model fit tests, the Satorra-Bentler scaled chi-square test [46] was computed. However, the chi-square difference test is sensitive to sample size [47]. For this reason, more comparisons in the increment of other indices were made to test the invariance between less and more constrained models. In order to consider a model to be invariant, the ΔCFI should be �.010 and the ΔRMSEA should be � .015 [48,49].
In both the Spanish and English questionnaire versions, sources of reliability were explored by resorting to Cronbach's alphas and ordinal omegas [50]. The sources of evidence for criterion validity were explored with Pearson correlations among all the personality dimensions and psychopathology, antisocial behavior, marijuana outcomes, rumination and happiness in all three countries. Table 1 provides the descriptive statistics (means/standard deviations) for all the personality dimensions, criterion variables and the participants' ages for the whole sample and per country. The comparison made of the magnitude of the mean differences across countries indicated that, despite medium (USA and Spain; Spain and Argentina) and large (USA and Argentina) differences in the participants' ages across countries, all the differences in personality and the criterion variables were small (all the ds were below .50).

Factor structure
When studying the BFPTSQ structure in the total sample, the best fitting model was the ESEM model, in which correlated uniquenesses were allowed (M2b). See the fit indices of all the models performed in Table 2. Table 3 reports the standardized factor loadings for the whole sample. All the items had significant factor loadings on its hypothesized factor, except for item 31r ("Is not really interested in different cultures, their customs and values") and 41r ("has few artistic interests") on the openness factor. All the items for the conscientiousness factor and emotional stability presented the highest factor loading on their intended factor. Eight extraversion factor items showed the highest factor loadings on its hypothesized factor, while items 42 ("likes exciting activities that provide thrills") and 47 ("has a tendency to laugh and have fun easily") showed the highest factor loadings on the openness factor. Five agreeableness factor items had the highest factor loadings on its intended factor (items 3r, 13r, 28r, 38r, 48r), and five items showed similar cross-loadings between agreeableness and the openness factor (items 8,18,23,33,43). Table 4 shows the latent factor correlations from the final ESEM and the ICM-CFA. Upon finding the best factor solution, an ESEM was performed in each country. Fit indices were acceptable for the Spanish sample. In the Argentinian sample, the CFI and TLI were close, but lower than .90. However, the RMSEA, RMSEA 90% CI values and SRMR were adequate (�.05). In both samples, factor loadings were salient and significant in its hypothesized factor, except for items 31 and 41 in the openness to the experience factor (see S2 Table). All .408 The column "whole sample" presents the factor loadings of the M2b model, while the columns "English" and "Spanish" correspond to the factor loadings of the MG1 model of the Invariance between the English vs. Spanish versions. The fit indices of both models are presented in Table 2. Bold denotes all the significant factor loadings (the 99% CI does not cross zero). the items for conscientiousness, agreeableness and emotional stability presented the highest factor loading on their intended factor, while items 27 ("show self-confidence, is able to assert himself/herself") and 42 ("likes exciting activities that provide thrills") from the extraversion factor showed similar cross-loadings in the extraversion and openness to the experience factor. The fit indices of the English version were adequate. The factor loadings of the English version are presented in Table 3 (i.e., as they are the same as those obtained in the configural invariance model across the English and Spanish versions), and they were very similar to those found in the whole sample.

Measurement invariance
A few minor differences emerged across groups when studying the invariance of the Spanish questionnaire version between the Argentinian and Spanish participants. Constraining the intercepts across Spanish speakers resulted in a ΔRMSEA below .015, and ΔCFI was -.017 (MG3). Hence a model with a partial invariance of intercepts (MG3b) was estimated. Based on the modification indices, four items across groups were freed: 13r ("provokes quarrels or arguments with others"), 38r ("can sometimes be rude or mean to others"); 43 ("likes to cooperate with others") (Arg > Sp) from the agreeableness factor; 49r ("can do things impulsively without thinking about the consequences") (Arg > Sp) from the conscientiousness factor. This model gave a better fit than the model with the fully invariant intercepts and ΔCFI � .01. When further constraints were included (MG4 to MG7), ΔCFI was � .01 and ΔRMSEA was � .015, which suggested reasonable invariance across groups. Considering that the structure of the Spanish BFPTSQ had been previously studied [17], and small differences across Spanishspeaking samples had also been found, the Argentinian and Spanish samples were considered together when the structure of the Spanish version was compared with the English version.
To test the measurement invariance of the English and Spanish (Spanish and Argentinian combined samples) versions, a configural invariance model was performed (MG1). This model showed acceptable fit indices as it can be seen in Table 2. Its factor loadings are presented in Table 3. When the factor loadings were constrained across Spanish and English speakers, ΔRMSEA was .001 and ΔCFI was -.016 (MG2). Therefore, an ES-W-C was run to test the partial metric invariance (MG2b). According to the modification indices, six factor loadings (6 of 250) were freely estimated across groups. One was a difference in the nonstandardized factor loadings of one item on their target factor: 20r ("worries a lot about many things") on the emotional stability factor (Spanish = .454 [.361 .546]; English = .772 [.668 .876]). The others were differences in cross-loadings. Adding constraints between the intercepts across groups also indicated differences (MG3, ΔCFI = -.025). Thus, a model with partial invariance of intercepts (MG3b) was estimated. According to the modification indices, eight items were freed across groups: 4 ("works conscientiously, does the things he/she has to do well") (Eng < Sp); 9r ("can be a little careless and negligent") (Eng > Sp); 11 ("Is ingenious, reflects a lot") (Eng < Sp); 20r ("worries a lot about many things") (Eng > Sp); 22r ("is rather quiet, does not talk much") (Eng > Sp); 28r ("can be distant and cold with others") (Eng > Sp); 32r ("is timid, shy") (Eng > Sp); 36 ("likes to reflect, tries to understand complex things") (Eng < Sp). This model indicated a better fit than the model with the fully invariant intercepts and gave ΔCFI � .01. Including additional constraints (MG4-MG7) gave ΔRMSEA � .015 and ΔCFI � .01, which suggests invariance across groups. Note that for convergence problems, in the case of invariance between the English vs. the Spanish version, the correlated uniquenessess invariance (MG4) was tested first followed by the measurement errors invariance (MG5), rather than backwards.
The results of the invariance analyses done across gender are also presented in Table 2 and indicated that this model was completely invariant (all the ΔCFI � .01, and the ΔRMSEA � .015) when specifying the constraints among factor loadings (MG2), intercepts (MG3), measurement errors (MG4), correlated uniqueness (MG5), variances and covariances (MG6) and factor means (MG7) across groups of males and females.

Criterion-related validity
The correlations between personality domains and criterion variables in all three countries are presented in Table 6. The results demonstrated that health outcomes were related to low emotional stability, low agreeableness, low conscientiousness and low extraversion. Internalizing symptomatology (i.e., depression, anxiety and somatic distress) showed the closest associations with low emotional stability in all three countries. Some externalizing behaviors were related to low agreeableness and low conscientiousness in the three countries (e.g., antisocial behavior), and others in only USA and Spain (e.g., alcohol, tobacco and illicit drug use). The correlations found for personality dimensions and the marijuana-related variables were low, but most of the significant associations were found with low conscientiousness. Finally, rumination in the three countries was related mainly to low emotional stability, and also to low conscientiousness, low agreeableness and low extraversion, but to a lesser extent. Happiness correlated mainly with emotional stability, followed by extraversion. In order to determine if personality dimensions were related differentially to distinct criterion variables across countries, the absolute value of the differences in the magnitude of the correlations for pairs of countries was computed and is presented in Table 7. As the statistical tests of these differences can be oversensitive to small differences when including differences in sample sizes across countries, attention was paid to the magnitude of these differences. The average difference in correlations was .070 (SD = .055) across 330 possible comparisons. The results were interpreted using the following: differences <1 SD were small, differences between 1 SD and 2 SD were medium, those between 2 SD and 3 SD were large, and any over 3 SD were substantial. Results presented in Table 7 showed that large or medium size correlation   differences across countries were found between conscientiousness and some health outcomes (i.e., anger, mania, anxiety, somatic distress, suicidal ideation and memory) (higher correlations in US than in Argentina or Spain), and also between low agreeableness and low conscientiousness with drug outcomes (higher correlations in US or Spain than in Argentina).

Discussion
The present study examined different sources of validity of the English and Spanish versions of the BFPTSQ [8,17] in college students from US, Argentina and Spain. Specifically, we examined whether the BFPTSQ was invariant across two Spanish-speaking populations (Spain and Argentina), across languages (Spanish and English) and across gender. The criterion-related validity was examined via associations among the five BFPTSQ domains and a large set of psychological constructs (i.e., psychopathology, antisocial behavior, marijuana use and negative marijuana-related consequences, rumination and happiness) in the full sample as well as within each country.

Evidence for internal structure validity
Not surprisingly, when considering previous work that have examined complex structures such as the Big Five [8,17,39], the factor analysis results for the whole sample suggested that ESEM provided a better data fit than the ICM-CFA. Thus, the fact that ESEM allows for all possible factor loadings appears to better approximate the true model than the ICM-CFA. The present study included 125 statistically significant cross-loadings of the 250 possible factor loadings (i.e., 50%). The inclusion of cross-loadings affected the intercorrelations among the personality dimensions as it was shown in Table 4. In the ICM-CFA, cross-loadings were set at 0, and the factor correlations were vastly inflated as this is how these cross-loadings can be represented [8,39]. However, the ESEM not only provides factor correlations that probably come closer to the true population parameters, but also supports the discriminant validity among the Big Five traits as measured by the BFPTSQ [8].
Our findings also indicated an improved model fit when correlated uniquenesses were allowed. Despite including the correlated uniquenesses between the items that were reversedcoded within the same factor or shared the same words being conceptually defensible and increasing the model's fit, they also inevitably reduced the size of the factor loadings as factors had less variance left to explain. This was salient for the openness factor, which allowed seven correlated uniquenesses and provided lower factor loadings. However, it was noteworthy that the ESEM model's fit was acceptable even when correlated uniquenesses were allowed, but was still far from excellent according to the typical criteria suggested for practical fit indices [51]. Morin et al. [45] noted that the adequacy of these typical criteria has yet to be demonstrated with ESEM.
All the items presented significant factor loadings on their intended target factor, except items 31 ("is not really interested in different cultures, their customs and values") and 41 ("has few artistic interests") of the openness factor (in both the whole sample and the Spanish-and English-speaking samples). A previous study conducted with a general Spanish population sample indicated the primary factor loadings of items 31 and 41 respectively on the openness factors of .38 and .58. The factor loadings for French-speaking [8] and Spanish adolescents [52] were also low. Taken together, and considering that the wording of the items is simple (which arguably implies fewer translation/adaptation problems), these findings suggest that they may not be that suitable for specific populations (i.e., adolescents and young adults) compared to general or adult populations. The rewording, or even the elimination, of these items should be considered in future research, chiefly because the BFPTSQ was developed to supply a useful valid measure for longitudinally assessing personality dimensions across development (i.e., from adolescence to adulthood).
With the extraversion dimension, all the items showed salient factor loadings on their intended factor (i.e., > .30), except for item 42 ("likes exciting activities that provide thrills"). This sensation seeking-related item had a factor loading of .27 on extraversion, and a factor loading of .38 on openness. Previous studies have indicated that sensation seeking tends to be openness-related [53,54].
All the emotional stability and conscientiousness items showed salient factor loadings on their intended factor, but some items in the agreeableness dimension also cross-loaded on the openness factor. As expected, the reverse-coded items that indicated antagonism or low agreeableness loaded on the agreeableness factor, while the positively worded items of this domain cross-loaded on the openness factor. Future revisions of the scale should consider these findings, and the fact that the use of positively worded items and reversed forms in the same scale (e.g., agreeableness) to reduce response bias has been questioned. Suárez-Alvarez et al. [55] illustrated this point and found that this common practice jeopardizes a measure's unidimensionality by adding secondary sources of variance, and also reduces its reliability.

Sources of structure validity across groups
In line with previous research, our findings supported the measurement invariance of the BFPTSQ across gender groups [8]. The present findings extend these results by suggesting that the BFPTSQ is also invariant across Spanish-speaking countries. This is a key milestone in cross-cultural research as comparisons between cultures/countries are not valid unless measurement invariance is met [56,57]. Of all the possible comparisons based on CFI changes (in intercepts, factor loadings, uniquenesses, factor variances/covariances, factor latent means and correlated uniquenesses among groups), only four differences were found in the intercepts of the Spanish and Argentinian students. Compared to the Spanish students, the Argentinian ones scored higher for three agreeableness items, which cover the facets of compliance (13r and 38r) and cooperation (43), and also for one conscientiousness item, which covers the facet of deliberation (49r).
Our measurement invariance results across languages (i.e., Spanish and English) revealed that all the agreeableness items loaded primarily on their intended factor in the Spanish-speaking sample. However, in the English-speaking sample, the positively-worded agreeableness items had similar factor loadings on the agreeableness and openness factors as it was shown in Table 3. Having empirically tested the magnitude of these differences, only one difference was found in a primary factor loading: item 20r ("I see myself as someone who worries a lot about many things"). Despite the factor loading of this item being higher in the English-speaking sample than in the Spanish-speaking one, the factor loading was salient and significant in both samples. This finding indicated that it adequately represented its dimension in both groups.
The addition of constraints between intercepts only led to a few noninvariant intercepts across languages. Spanish speakers tended to obtain higher scores than English-speakers for one conscientiousness item tapping the self-discipline facet (4) and for two openness items tapping the intellectual inquisitiveness facet (11 and 36). Compared to the Spanish-speaking participants, English speakers scored higher for: a) two extraversion items tapping the expressiveness (22r) and sociability (32r) facets; b) one for the agreeableness item tapping the compassion facet (28r); c) one conscientiousness item tapping the order facet (9r); and d) one emotional stability item tapping the worry facet (20r). Although a few intercepts and noninvariant loadings were observed across groups based on CFI differences, the RMSEA differences still suggested that the model was completely invariant across languages. The noninvariance of some intercepts and loadings was based mainly on the proposed typical criteria of changes in the fit indices. Nevertheless, these criteria are rough guidelines [42] and some researchers have questioned their validity [58,59], especially for ESEM [45]. Hence rejecting the hypothesis of the invariance of complex models with several items based simply on these typical criteria might not be constructive. Taken together, the results herein obtained suggest that it can be reasonably assumed that the BFPTSQ factor structure offers acceptable measurement invariance across languages.

Criterion-related validity
The present study aimed to examine the association between the BFPTSQ scores with a wide diversity of outcome variables by particularly focusing on substance use (or substance-related variables) and poor mental health outcomes. As already noted, these behaviors are highly prevalent in college students around the world and a valid, yet brief, version will most likely facilitate both cross-cultural studies and routine interventions to detect students at high risk of developing substance use and/or mental health problems. As in previous studies, low emotional stability, low agreeableness, low conscientiousness and low extraversion were related to mental health outcomes [2,60]. Internalizing symptomatology (i.e., depression, anxiety and somatic distress) was related mainly to low emotional stability [2,60], while antisocial behavior was related to low agreeableness and low conscientiousness in all three countries [33,61].
Other externalizing behaviors, such as drug use, also showed significant correlations with low agreeableness and low conscientiousness, as in previous meta-analysis [2,61], at least in the USA and Spain. The association of disinhibition domains with drug outcomes was less consistent in Argentina and, consequently, some differences in the magnitude of the correlations arose across countries (i.e., medium-size difference in correlations between low conscientiousness and low agreeableness with alcohol use and tobacco use, respectively, in the USA and Spain compared to Argentina). The correlations previously reported between marijuanarelated outcomes and conscientiousness/agreeableness [62][63][64][65] were only replicated clearly in the USA sample. When we calculated the absolute value of the correlation differences across countries, the only large difference found (i.e., that was between 2 SD and 3 SD above the average difference in the magnitude of the correlations) was between conscientiousness and marijuana-related problems in the USA and Argentina. Lack of a significant negative association between marijuana-related problems and conscientiousness in the Argentinean sample was somewhat unexpected, as in the case of low conscientiousness and low agreeableness with alcohol, tobacco and illicit drug use. Future research should replicate this finding to know if the disinhibition-related domains assessed within the FFM framework could influence drug outcomes in Argentinian college students.
In line with previous studies conducted using the BFPTSQ and other measures to assess the FFM, happiness was mainly related to both emotional stability (or low neuroticism) and extraversion [17,66,67], while rumination was related chiefly to conscientiousness, emotional stability, agreeableness and extraversion [37].

Limitations and conclusions
Our research is not without its limitations. First, even though the BFPTSQ's psychometric properties have been previously explored in a Spanish population [17], this is the first time that the English version structure has been tested. Based on some of our results (i.e., the nonsignificant factor loadings of items 31 and 41 on the openness factor, or the cross-loadings between the openness and agreeableness items), replication studies are needed before the questionnaire can be modified (i.e., remove or substitute items). Our sample comprised of university students, and future research should examine the reliability and construct validity of the English questionnaire version in both adolescent and general adult populations. Finally, our work explored evidence for criterion validity with a limited number of outcomes (i.e., psychopathology, antisocial behavior, marijuana-related outcomes, rumination, and happiness). Previous work has found an association between personality and a wide range of other health-related behaviors (e.g., work and educational outcomes [5,68]). Future research that employs the BFPTSQ could benefit from including more criterion variables.
Despite its limitations, the present research supports the BFPTSQ's factor validity, the reasonable invariance of the measure across genders, across two Spanish-speaking populations, and between Spanish and English speakers. It also evidences the scales' reliability and criterion validity (associations with distinct health outcomes). Taken together, our results suggest the BFPTSQ is a useful short measure for assessing the FFM broad domains between English and Spanish speakers, at least for young adults from the USA, Argentina and Spain.  Table. Standardized Factor Loadings from the Exploratory Structural Equation Model of the BFPTSQ in Argentinian and Spanish Samples (M2b). The columns "Argentina" and "Spain" correspond to the factor loadings of the M2b model (see Table 2). Bold denotes all the significant factor loadings (the 99% CI does not cross zero). λ = factor loadings; δ = uniquenesses. (DOCX)