Validation of the Multicultural Personality Questionnaire Short Form (MPQ-SF) for use in the context of international education

The Multicultural Personality Questionnaire (MPQ) is one of the most widely used instruments for measuring individuals’ intercultural competences. The original version consists of 91 items, divided into five subscales, and has been shown to predict attitudes, behavior, and outcomes in a variety of intercultural contexts. Recently, a 40-item short form of the MPQ was developed (MPQ-SF), which may be particularly useful in settings in which time or survey space are limited, or where respondent drop-out is likely to occur. For example, the MPQ-SF would be a valuable tool for assessing longitudinal development of multicultural personality traits in training or educational settings. A prerequisite for such research is to establish measurement invariance of the MPQ-SF between different respondent groups, as well as across time points. Using a sample of students in an international university program (n = 519), the present study examines how the scales perform among male and female respondents, between students of Western and Non-Western background, and across two time points, five months apart. Based on our findings, we conclude that all five subscales of the MPQ-SF display sufficient measurement invariance to be reliably used in this and similar contexts, in comparative as well as longitudinal study designs.


Introduction
Due to globalization and increased international mobility, it has become much more common for people to interact with others with different cultural backgrounds. Studies have shown that individuals may react differently to cultural diversity in their social environment, leading some to be better able to communicate effectively across cultural boundaries and adjust more successfully to intercultural situations [1][2][3]. A significant predictor of these intercultural competences is personality [4][5][6]. Through meta-analysis [7], it was established that domain specific personality traits, such as cultural empathy and cross-cultural self-efficacy, are better predictors of intercultural effectiveness than more general personality traits such as the Big Five [8]. Therefore, to predict how individuals behave in intercultural contexts, scholars have identified does not provide information on how the short scales perform among different groups of respondents. The first aim of the present study, therefore, is to further establish the validity of the MPQ-SF by examining between-group measurement invariance. Firstly, we will compare whether the subscales perform similarly between male and female respondents. Secondly, we will examine how it performs among individuals with different cultural backgrounds. This cross-cultural validation is particularly essential for future use of the instrument, since it is purposely designed for implementation in international and/or intercultural environments.

Longitudinal development of multicultural personality traits
One of the major challenges in the research area of intercultural competence, is to examine how traits and skills develop over time. There is a growing interest in uncovering measurable effects of training and education on intercultural effectiveness [1], and as an extension of this question, if exposing oneself to an intercultural interactions may contribute to the development of MPQ dimensions. The MPQ was originally designed as a personality measure [11], implying that its traits are relatively stable over time. However, it has become apparent that individuals can indeed train intercultural competences through intercultural learning and experiences [31,32], and may also increase their MPQ scores as a result [29].
A particularly interesting area of exploration is the development of multicultural personality among university students in international university programs. Many institutes of higher education around the world emphasize internationalization as one of their key features, which has resulted in a profound increase of the number of students completing (part of) their education abroad [33]. The long-term effects of these processes on students' multicultural personality remain poorly understood [34].
In order to tease out how MPQ dimensions change as a result of international education, scholars need to employ longitudinal study designs, and compare scores at different points in time. Because of the importance of reducing respondent drop-out, the MPQ-SF may be a particularly useful instrument for this type of research. However, to be able to use it as an instrument for tracking longitudinal development, it is important to first establish reliability of its subscales across time points. Earlier studies have reported on changes in average (long scale) MPQ scores between different measurements [18,28,29], but longitudinal measurement invariance has not been examined for either version of the instrument. As such, the second aim of the present study is to establish longitudinal validation of the subscales of the MPQ-SF in this particular context.

Sample and procedure
The study was approved by the Ethics Review Board of the Erasmus School of History, Communication, and Culture (ESHCC) at Erasmus University Rotterdam. Written informed consent was obtained from all respondents. The sample for this study consisted of two cohorts of first-year students, of an international English-language bachelor program, at a major research university in the Netherlands. Students were invited to complete the Multicultural Personality Questionnaire-Short Form [30] at two different time points, five months apart. The first invitation was sent after the students were notified of admission to the program, four months before the start of the academic year (T1-May). The second invitation was sent one month after the start of the program (T2-October). This approach allowed us to compare how the MPQ-SF performed in two different context, just before and after a major life event.
All questionnaires and related communication were in English. In order to apply for the program in question, non-native speakers of English were required to demonstrate proof of proficiency in English through a TOEFL, IELTS or Cambridge certificate. As such, we assume all participants were able to understand and complete our questionnaires. An informed consent form was included at the start of each questionnaire, asking participants to confirm they had read and understood the conditions of the study, and agreed to participate voluntarily. No compensation was given for participating.
The recruitment of respondents was conducted in two consecutive years (2017-2018), with identical procedures for each cohort. At T1, the digital invitation to complete the questionnaire was sent by e-mail to all students who had been admitted to the program. Across the two years, a total of 658 respondents were invited to participate in the first wave, of whom 456 (69.3%) responded. At T2, students who ultimately decided not to enroll in the program were removed from the mailing list. Across the two years, a total of 480 active and enrolled students were sent the invitation to participate in the second wave, and were given time in class to complete the questionnaire. Nearly all (453; 94.4%) responded. The final sample used in this study consisted of 519 respondents (60.5% female, 14.5% male, 25.0% other or unknown; M age = 19.1 (18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35), SD = 1.58), who completed either of the two questionnaires. Of the final sample, 390 respondents (65.7%) participated at both T1 and T2, 66 (12.7%) participated only at T1, and 63 (12.3%) participated only at T2. Both cohorts were equally represented in the final sample (2017: n = 268; 51.6%, 2018: n = 251; 48.4%); no significant differences were found between cohorts on any of the study's variables.

Measures
At both T1 and T2, respondents completed the 40-item version of the MPQ-SF [30], consisting of 8 items per dimension, which ask respondents to self-report whether certain personality characteristics are applicable to themselves, on a 7-point Likert Scale (1 = not at all applicable; 7 = completely applicable). Cultural Empathy (CE) was measured using 8 items, such as 'Pays attention to the emotions of others'; Emotional Stability (ES) was measured using 8 items, such as 'Keeps calm when things don't go well'; Flexibility (FX) was measured using 8 items, such as 'Looks for regularity in life' (Reversed); Openmindedness (OP) was measured using 8 items, such as 'Seeks people from different backgrounds'; Social Initiative (SI) was measured using 8 items, such as 'Is inclined to speak out'.
Respondents were also asked to voluntarily indicate their age (in years), their gender (female / male / other), and their country of birth (open question). The data on country of birth was used as a proxy for the respondents' cultural background. In total, 74 countries were represented in the sample.
Ideally, a full cross-cultural validation would involve examining invariance between a number of cultural groups, each with sufficient sample size to reliably fit the different models under scrutiny [35]. However, due to limited sample sizes of specific national groups in our sample, the present study makes use of only two aggregated groups, consisting of students of Western and Non-Western background. The Western group included those who were born in Europe, North America, Australia, or New Zealand (n = 283, 54.5%), of which those from the Netherlands (29.1%), Germany (4.6%), France (1.7%), Italy (1.3%) and the United Kingdom (1.2%) were the largest subgroups. The Non-Western group included respondents born elsewhere (n = 105; 20.2%), of which those from Vietnam (5.3%), China (2.0%), India (1.7%) and Indonesia (1.0%) were the largest subgroups. For 131 respondents (25.2%), country of birth was unknown.
In Table 1, we provide an overview of the reliability and descriptive statistics of all the MPQ-SF subscales at both time points, as well as their correlations with each other and with control variables.

Analyses
In line with established procedures [36][37][38], we examined between-group measurement invariance by testing the factor structure across gender (male vs. female) and cultural background (Western vs. Non-Western), through three separate models. First, we fitted a configural invariance model (same number of factors between groups). Next, we fitted a weak invariance model, in which the factor loadings were constrained to be equal between groups. Finally, we fitted a strong invariance model in which, next to the loadings, the intercepts of the items were also constrained to be equal between groups.
To test the successive 'degrees' of measurement invariance, we compared the fit of the weak invariance model to that of the configural invariance model, and the fit of the strong invariance model to that of the weak invariance model. We also tested if the configural invariance model had sufficient fit to the data. Significant differences in Chi-square, changes in Comparative Fit Index (CFI) larger than .01, changes in Root Mean Square Error of Approximation (RMSEA) larger than .015, and changes in Standardized Root Mean Square Residual (SRMR) larger than .030 were taken as indication of significantly worse fit between the models [39]. If the weak invariance model fit worse than the configural model, individual loadings were freely estimated between groups, and the number of freely estimated loadings was increased (with additional free loadings added in order of the loadings with the largest difference between groups), until the (now partial) weak invariance model did not fit worse than the configural model. Similarly, if the strong invariance model fitted worse than the weak or partial weak invariance model, individual intercepts were freely estimated across groups (with free intercepts added in order of the intercepts with largest difference between groups) until the (partial) strong invariance model did not fit worse than the weak or partial weak invariance model. This procedure was followed for each of the five factors of the MPQ-SF, for T1 and T2 separately. Similarly, we tested longitudinal measurement invariance by successively testing configural, weak, and strong invariance, but now across timepoints, freeing individual loadings and/or intercepts if necessary. When testing longitudinal invariance, we allowed the residuals of the indicators to covary between timepoints in all models.

Measurement invariance for gender
As shown in Table 2, for Cultural Empathy (CE), a 1 factor model had good fit at T1 and moderate to good fit at T2. At T1 the CE model had partial strong invariance for Gender (with only the intercepts of items 2 and 5 needing to be estimated separately for males and females). At T2 the CE model again showed partial strong invariance for Gender (with only the loading for item 4 needing to be varied across gender).
For the Emotional Stability (ES) factor, items 2 and 8 and items 4 and 6 were allowed to have residual covariances in order to get sufficient model fit. With these additions the configural model had moderate to good fit at both T1 and T2. At T1 the ES model had partial strong invariance for Gender (with only the intercepts of items 4 and 8 needing to be estimated separately for males and females). At T2 the ES model showed strong invariance for Gender.
For the Flexibility (FX) factor, items 2 and 3 and items 4 and 5 were allowed to have residual covariances in order to get sufficient model fit. With these additions the factor model had moderate to good fit T1 and weak to moderate fit at T2. At T1 the FX model had partial strong invariance for Gender (with only the intercepts for items 1 and 5 estimated freely across groups). At T2 the FX model again showed partial strong invariance for Gender (with the loading for item 4, and the intercepts for items 2 and 3 needing to be varied across groups).
For the Openmindedness (OP) factor, the model did not fit the data at T2 well, but since the longitudinal configural model had good fit (see Table 3 and discussion below) we decided not to add any residual covariances. At T1 the OP model had partial strong invariance for Gender (with only the intercepts for item 7 estimated freely across Gender). At T2, the OP model had strong invariance for Gender.
For the Social Initiative (SI) factor model, items 3 and 7 needed to be correlated to get weak to moderate fit at both T1 and T2. At T1 the SI model had strong invariance for Gender. At T2 the SI model showed partial strong invariance (with the loadings for items 2, 4, and 5 needing separate estimation across genders).
In sum, for the subscales of the MPQ-SF, we found only a small number of parameters that differed between males and females. Due to the fact that these differences were not constant across time (different parameters needed to be freely estimated at T1 compared to T2), and the number of parameters that needed to be freely estimated was small, we conclude that for all factors, the models could meaningfully be fitted to the sample as a whole. Our findings thus show that the MPQ-SF performs similarly among both males and females.

Measurement invariance for cultural groups
The same procedure was followed in comparing how the MPQ-SF factors performed among Western vs. Non-Western cultural groups. As shown in Table 3, CE displayed strong invariance for cultural group at T1, and partial strong invariance at T2 (with only the loading for item 7 needing to be freely estimated).
At both T1 and T2, the ES model had strong invariance for cultural group.
At T1 the FX model had partial strong invariance for cultural group (with only the intercepts for item 6 estimated freely across groups). At T2 the FX model also showed partial strong invariance for cultural group (with only the loading for item 5 estimated separately for the Western and non-Western group).
At both T1 and T2, the OP model had strong invariance for cultural group.
At T1 the SI model had partial strong invariance for cultural group (with the loading for item 4, and the intercept for item 8 needing separate estimation across groups). At T2 the SI model showed partial strong invariance for cultural group (with only the loading for item 2 estimated separately for the Western and non-Western group). For the subscales of the MPQ-SF, we found only a small number of parameters that differed between the Western and Non-Western groups. As before, due to the fact that these differences were not constant across time, and only a small number of parameters needed to be freely estimated, we conclude that for all factors, the models could meaningfully be fitted to the sample as a whole. Our findings show that the MPQ-SF performs similarly among both Western and Non-Western respondents.

Longitudinal measurement invariance
To examine longitudinal measurement invariance, we tested the performance of the MPQ-SF factors across the two time points. As was explained above, 75.1% (n = 390) of the sample participated in the study at both T1 and T2. We first examined invariance using only this subsample. Next, the same models were tested using Full Information Maximum Likelihood (FIML) estimation on the full sample (n = 519), thus including participants with missing data on one of the time points. Results were comparable across both sets of analyses. Table 4 displays the results for our examination of measurement invariance across time points using FIML on the full sample. As is customary in testing for longitudinal measurement invariance, we allowed the residuals of the indicators to covary between timepoints in all models.
For CE, the configural invariance model had moderate to good fit to the data. While the difference in chi-square was significant between the weak and configural model, and between the strong and the weak model for CE, the differences in CFI, RMSEA, and SRMR were smaller than the boundaries specified above for worse model fit. We therefore concluded that the model for CE displayed strong measurement invariance across time. For the ES factor, the configural model had moderate to good fit to the data. The weak invariance model did not fit significantly worse than the configural model, but the difference in fit between the strong and weak invariance model was too large to conclude strong invariance across time. After allowing the intercept of item 7 to vary across timepoints, the difference with the weak invariance was no longer significant. We therefore conclude that the ES model had partial strong invariance across time, with only the intercept of item 7 differing between timepoints.
For the FX factor, the configural model had moderate to good fit to the data when allowing for these two residual correlations. The weak invariance model did not fit significantly worse than the configural model, but the difference in fit between the strong and weak invariance model was again too large to conclude strong invariance across time. After allowing the intercepts of items 2 and 7 to vary across timepoints we found the difference between the weak invariance and partial strong invariance model to be small enough to conclude partial strong invariance. We therefore concluded that the FX model had partial strong invariance across time, with only the intercepts of items 2 and 7 differing between timepoints.
For the OP factor, because the longitudinal configural model had good fit we decided not to add any residual covariances. The weak model did not fit worse than the configural model and the strong invariance model did not fit worse than the weak invariance model, implying that the model for OP showed strong measurement invariance across time.
For the SI factor, the configural invariance model displayed moderate fit. Because we wanted to stay as close to the theoretical models as possible, we decided to not add any more residual correlations and use the moderate fitting longitudinal configural model as our starting point. As the weak invariance model did not fit less well than the configural model (following the rules specified above), the strong invariance model did not fit less well than the weak invariance model, and since the strong invariance had moderate fit to the data (despite the configural model also only having moderate fit) we concluded that the model for SI showed strong invariance across time.
Overall, our results show that the subscales of the MPQ-SF display sufficient measurement invariance across time points to be reliably used in longitudinal research.

Summary of findings and implications
The question how intercultural competences, and specifically multicultural personality traits, develop in international training and education has become a major area of interest in intercultural research [1]. Understanding the development of such complex constructs over time requires the use of reliable, yet practical measures, of which not many are available. A promising instrument for measuring multicultural personality development is the 40-item short form of the Multicultural Personality Questionnaire (MPQ-SF), which has previously been established as a reliable alternative to the long version of the instrument [12,30]. However, further validation of the MPQ-SF was needed among different groups of respondents. Although the long version of the MPQ has already been shown to be reliable in different samples [12,40], this is the first study to examine measurement invariance between different groups of respondents for the short form.
Our results show that all five subscales perform similarly among both male and female students, as well as among Western and Non-Western students. Based on our findings, we can conclude that the MPQ-SF can be reliably used to measure and compare multicultural personality within and across these groups. The fact that the scales appear to be reliable among respondents with different cultural backgrounds is particularly important, because the instrument is specifically intended to be used in intercultural contexts. The second aim of this study was to examine longitudinal measurement invariance, by comparing how the MPQ-SF performs at two different time points, five months apart. Again, our results show sufficient invariance between these time points, and we conclude that the MPQ-SF can be reliably used in longitudinal research in the context of international education.
This opens up many new opportunities for further exploration, allowing scholars to track changes in respondents' multicultural personality over time, and examine factors which may influence how individuals' scores on the five subscales develop as a result of training or studying in an intercultural university program. Particularly in the context of international education, such studies may reveal new insights in the possibility to increase intercultural competences of students and staff. For example, it can be used to enhance the effects of going on an international exchange, to maximize the effects of cultural diversity in the students' social/educational environment, or by increasing interaction between local and international students.
At this point, a growing research tradition based on the MPQ and MPQ-SF will allow us to get a deeper understanding of these issues, which is crucial in informing institutional management on how to design new policies towards internationalization of higher education, and how to aid educators in designing a more inclusive multicultural classroom.

Limitations
As with any research, the present study has several limitations. The most notable is that, as with the original study that was used to develop the MPQ-SF, the present research was conducted using a sample of university students. Although this is an interesting and relevant respondent group, particularly considering the possible uses of the MPQ-SF in the context of higher education, it does limit generalizability. The original MPQ has already been validated among different groups of respondents, such as expatriates [24], migrants [14], participants in intercultural training [15] and job applicants [25]. We recommend future scholars who use the MPQ-SF in such contexts, to also test for measurement invariance in their own sample.
The second limitation of the present study is that our cross-cultural validation relies on comparison between two respondent groups only. A full cross-cultural validation examines cultural bias in item responses, between a larger number of cultural groups, each with sufficient sample size to reliably fit the different models under scrutiny [35]. However, due to limited sample sizes of specific national groups in our sample, the present study makes use of only two aggregated groups, consisting of Western and Non-Western respondents. As such, our findings provide sufficient support for reliable use of the MPQ-SF in a diverse intercultural context, such as higher education. However, for directly comparing levels of multicultural personality between specific cultural groups, a more detailed cross-cultural validation should be conducted first.
Finally, a word of caution for those who are interested in applying the MPQ in a professional context, for example in diagnostics or assessment. The use of short form measures to assess personality traits has a number of limitations that one should be aware of [41]. Although many of the possible deficiencies of the short-form have been mitigated through this study and earlier work [30], the full version of the MPQ remains the instrument of choice for such settings.

Conclusion
The MPQ-SF is shown to be a useful and reliable instrument for measuring multicultural personality in research contexts where time or survey space is limited, or where drop-out of respondents more likely to occur. Through a study conducted in an international university context, we have established that all five subscales perform similarly among men and women, as well as among Western and Non-Western respondents, and that the instrument is invariant across time points. This allows the instrument to be used in studying the longitudinal effects of international education on multicultural personality.