Evaluating the Psychometric Properties of the Maslach Burnout Inventory-Human Services Survey (MBI-HSS) among Italian Nurses: How Many Factors Must a Researcher Consider?

Background The Maslach Burnout Inventory (MBI) is the mainstream measure for burnout. However, its psychometric properties have been questioned, and alternative measurement models of the inventory have been suggested. Aims Different models for the number of items and factors of the MBI-HSS, the version of the Inventory for the Human Service sector, were tested in order to identify the most appropriate model for measuring burnout in Italy. Methods The study dataset consisted of a sample of 925 nurses. Ten alternative models of burnout were compared using confirmatory factor analysis. The psychometric properties of items and reliability of the MBI-HSS subscales were evaluated. Results Item malfunctioning may confound the MBI-HSS factor structure. The analysis confirmed the factorial structure of the MBI-HSS with a three-dimensional, 20-item assessment. Conclusions The factorial structure underlying the MBI-HSS follows Maslach’s definition when items are reduced from the original 22 to a 20-item set. Alternative models, either with fewer items or with an increased number of latent dimensions in the burnout structure, do not yield better results to justify redefining the item set or theoretically revising the syndrome construct.


Introduction
Occupational burnout is a psychological response to chronic work-related stress of an interpersonal and emotional nature that appears in professionals working directly with clients, patients, or other recipients. Maslach defined burnout in the 1970s as ''a syndrome of emotional exhaustion, depersonalization, and reduced personal accomplishment that can occur among individuals who do 'people work' of some kind'' ( [1], p. 3). This conceptualization led to the identification of the three main dimensions of burnout that are assessed in the Maslach Burnout Inventory-MBI [2], the worldwide leading instrument for the assessment of burnout, by means of three sub-scales: emotional exhaustion (EE), depersonalization (DP), and personal accomplishment (PA).
Various versions of the MBI exist. The first [3], intended for workers employed in health and social services, was later renamed MBI-Human Service Survey (MBI-HSS) to differentiate it from the one developed for educators, the MBI-Educators' Survey (MBI-ED) [1]. In the 1990s, research on burnout was extended to professionals other than those employed in human services: Schaufeli et al. [4] developed a third questionnaire, the MBI-General Survey (MBI-GS), to be used for general professionals.
Despite its popularity, the validity of the MBI -in all its versions -has been the subject of considerable debate [5,6,7] and many scholars tested alternative models of the inventory, increasing or decreasing the number of the factors or reducing the original 22-item set. The studies conducted between 2000 and 2014 on the factorial structure and on the psychometric properties of the MBI-HSS are systematically reviewed and described in Table 1 in terms of samples, data analysis, and results.
Although most studies have obtained a factorial analysis similar to Maslach's, the faithful reproduction of the model is not associated with entirely satisfactory fit indices in any of them. Such results have led several scholars to highlight some problematic aspects of the three sub-scales and to suggest that the original analysis must be reconsidered [3,1]. To overcome the psychometric limits of the MBI-HSS, researchers [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24] have proposed different and not always matching solutions that can be synthesized into four main procedures: allowing correlated error terms, allowing items to load on more than one factor, eliminating items and increasing or decreasing the number of factors. The elimination of items is the most commonly adopted practice. Maslach [25] herself suggested eliminating items 12 (PA, ''I feel very energetic'') and 16 (EE, ''Working with people directly puts too much stress on me''). In their early studies, Maslach and colleagues found that item 12, intended to load on PA, measures EE, while item 16, intended to load on EE, tends to overlap with DP. Further studies [10,13,16,17,19,21] have confirmed the results obtained by Maslach et al. [25], thus highlighting how fit indices considerably improve when items 12 and 16 are deleted.
A second research corpus detected other problematic items. Poghosyan et al. [24], in a study involving eight samples from the United States (U.S.), Canada, the United Kingdom (U.K.), Germany, New Zealand, Japan, Russia, and Armenia, One of the most disputed issues concerns the role of PA in the syndrome. In several studies PA was weakly correlated with the other dimensions that, in contrast, showed quite high correlations between them. This led Green et al. [26] to consider PA as less crucial than EE and DP, identified as the ''core dimensions'' of burnout. More recently, Kalliath et al. [8] gathered empirical evidence for a bidimensional version composed of EE and DP, containing only seven items.
Faced with unsatisfactory results, other scholars have attempted to conceptually reformulate the construct and suggested four -or even five -factor structures. Among these, Densten [12] proposed a five-factor model, in which EE and PA were divided, generating the components of psychological strain, somatic strain, self-component (self-perceived professional competence) and others-component (performance perceptions of others). Gil-Monte [13] suggested a four-factor solution in which, along with the EE and DP, two others dimensions originated from PA were added: the self-competence and the existential component linked to the interaction with patients. Similarly, Chao et al. [11] explored a four-factor structure dividing PA into indifference toward patients and rejection of the recipients.
In sum, these contrasting results on MBI-HSS psychometric properties warrant several considerations. It is well known that the MBI was developed in the North-American context, and it is therefore probable that the absence of systematic results may be caused by the linguistic and cultural heterogeneity of the samples on which the model was tested. Items may indeed assume different meaning depending on the context in which they are presented (according to Maslach et al. [27] North Americans may be more likely than Europeans to give ''extreme answers'' to items or to express cynicism), or because occasionally something is ''lost in translation'' [28].
Psychometric studies with Italian participants are rare. Sirigatti and Stefanile [29,30,31,32], who edited the adaptation of the inventory for Italy in the 1990s, suggested that different models had to be tested in order to determine the most suitable factorial solution, but this proposal has never been developed. Moreover, exploratory factorial analyses conducted on heterogeneous samples in relation to Italian occupational sectors have highlighted that the structure proposed by Maslach, however reachable by imposing the three factors, is not always the most adequate.
The aim of the present study is then to examine the factor structure and the psychometric properties of the MBI-HSS and to gain insight into the functioning of the MBI-HSS in Italy as follows: (a) evaluating the functioning, i.e., reliability and validity, of the MBI-HSS items with regard to an Italian sample; (b) testing the main alternative MBI-HSS models in order to identify the most appropriate model to measure burnout.
The study includes the comparison of ten different models: the original Maslach's model specification [1], the first theoretical, relevant revision of the model proposed by Green et al. in the 1990s [26] and eight of those previously reviewed (identified as numbers 2, 3, 7, 8, 11, 13, 14, and 15 in Table 1). These eight models have been considered for the present study because they A) avoid covariances between error terms, B) avoid cross loading items, and C) imply the elimination of a maximum of four items. Including covariances between error terms implies admitting problems in item phrasing, which can result in response bias -such as acquiescence or impression management [33,34,35] -or lexical redundancy in items wording and specification, or item redundancy [36,37]. Specifying models with cross loading items on multiple factors compromises their integrity [38]. Moreover, in trying to measure a multidimensional construct, each factor's content coverage in the measure must be preserved. Each deleted item causes a loss of content validity, and the more items that are deleted, the more the content coverage is compromised. An abbreviated scale can result in a different, alternative assessment that does not measure what it originally intended to measure [39]. Table 2 presents the ten selected models for the comparison. Each model is identified by an alphanumeric label composed of the number of factors included in the model (2-5) and a letter (A-E) identifying the number of items within each factor when the number of latent dimensions remains stable but the set of considered items does not.

Materials and Methods
Data collection: participants, procedures, and instrument , and all ethical guidelines were followed as required for conducting human research, including adherence to the legal requirements of Italy. The research project was approved by the Hospital Board of Directors of the five hospitals involved in the study: Cardinal Massaia (Asti); SS. Annunziata (Savigliano, Cuneo); San Giovanni Bosco, Gradenigo, and Maria Vittoria (Turin). Additional ethical approval was not required since there was no treatment including medical, invasive diagnostics or procedures causing participants psychological or social discomfort, nor were patients the subject of data collection. With the Hospital Board of Directors' approval, department chiefs and nurse coordinators from each ward were asked for authorization to administer the questionnaire to the nurses. Participants volunteered in the research without receiving any reward and were not asked to sign consent forms, but the questionnaire return implied consent. The cover sheet clearly explained the research aim, the voluntary nature of participation, the anonymity of the data, and the elaboration of the findings.
Participants represented a sufficiently large and heterogeneous sample of the nursing staff employed in Northwestern Italy. The sample consisted of 925 operators, mainly women (66.6%), with a mean age of 37.9 years (SD 8.8), employed in the health sector for an average time of 14.4 years (SD 9.7) and actually working in emergency (37.8%), medical (25%), surgical (13.2%), mental health (9%), diagnostic (6.8%), maternity and infant (30%), or in ambulatory wards (4.3%). Data were collected through a self-administered questionnaire including: -background indicators of socio-demographic and professional attributes of participants; -the Italian version of the MBI-HSS [30], 22-item assessment with a seven-point frequency rating scale ranging from 0 (''never'') to 6 (''every day'').

Statistical analysis
The psychometric properties of the Italian version of MBI-HSS with reference to a sample of nurses were preliminarily examined performing an item analysis. Total and sub-scale reliabilities were assessed by means of Cronbach's coefficients, while the contribution to internal consistency at the level of single items was evaluated through item-total and item-subscale correlations.
In order to check the discriminatory capacity of each item, a procedure analogous to the one carried out by Lee et al. [22] was performed. The authors suggest calculating the critical ratio, i.e., the t-value obtained when comparing the means between two groups, the lower 27 th and the upper 73 rd percentile of the score distributions in the sample. In the present study, the group definition based on percentiles is maintained, but each item discrimination in relation to its theoretical sub-dimension is studied using ANOVA decomposition to represent it in terms of effect size.
Item evaluation also included considerations about response distributions, i.e., their normality and multi-normality. In the sample, items do not show a multivariate normal distribution. Therefore, the Prelis package was used to compute an asymptotic covariance matrix to correct the ML estimations obtained by the Lisrel software, 8.72 version [40]. Since in previous studies no agreement was reached on factor relations, each model was estimated with both orthogonal and oblique factorial specifications.
The model evaluation and comparison were conducted using both incremental and absolute fit indices: the comparative fit index (CFI) [41], nonnormed fit index (NNFI) [42], or alternatively, the Tucker-Lewis index (TLI) [43], root mean square error of approximation (RMSEA) [44] and standardized root mean squared residual (SRMR) [45,46]. Since these indices are widely used, their presentation facilitates comparison with previous results. Moreover, they are the most sensitive in distinguishing good models from poor ones, with misspecified factor loadings or/and factor covariances. Following Hu and Bentler [47], a cutoff value $.95 for CFI, a cut-off value #0.6 for RMSEA, and #0.8 for SRMR are an efficient strategy to evaluate model fit. Furthermore, the consistent Akaike information criterion (CAIC) [48] and expected cross-validation index (ECVI) [49], to compare non-nested models, and the Satorra and Bentler scaled difference (SB-Diff) to test the differences between nested models [50,51] were used.

Reliability and item analysis
The reliability of all items measured by Cronbach's a index is 0.800. The only item with strong inhomogeneity with reference to the whole scale is item 12 (PA, Cronbach's a, if the item is deleted50.822) with a negative item-total correlation (20.130). Cronbach's a for the sub-scales is 0.896 for EE, 0.755 for DP, and 0.821 for PA. EE items show quite high item-total correlation; the lowest correlation of the set is item 16 (0.524), whose deletion would not modify the internal homogeneity of the subset (Cronbach's a, if the item is deleted50.894). PA and, above all, DP item-total correlations are weaker, but no deletion would leave the two sub-scales unaltered nor increase their homogeneity.
All ANOVA tests performed on sub-dimensions are significant, which means that all items in each set can discriminate between the relative score extremes. Nevertheless, the value of g 2 coefficient associated with each item signals a different item performance. For every sub-dimension, g 2 is the proportion of score variability accounting for between-group differences after dividing participants into more and less exhausted, depersonalized, or accomplished groups (Table 3). We can interpret g 2 as items discriminatory power and even if significant, items 16 (EE) and 15 (DP) are least capable of discriminating between participants.
Items do not show normal distributions. Tests of multivariate normality in the Prelis application confirm the non-normality of items distribution: the reported normality chi-square test strongly rejects the null hypothesis (x 2 53639.7, p,0.000).

Model comparison using CFA
Two baseline models were used to obtain an exhaustive view of model performance analysis: M0, representing a null model, with no covariances between items, and M1 a unique dimension model of burnout. Model fits are presented ( Table 4) in order of factor cardinality, starting from M0 to M5 (five latent dimensions), and in order of item cardinality within each specification. Furthermore, each model occupies two rows because of the rotation specification: orthogonal or oblique.
The results show that, whatever the dimensionality specified, oblique solutions are better than orthogonal ones: factor covariance yielded an appreciable increase in fit, and the Satorra-Bentler scaled difference is always significant ( Table 5), suggesting that the relationship between the construct sub-dimensions cannot be ignored.
Based on the oblique specification results shown in Table 4, Maslach's model (M3) shows somewhat satisfactory performance, as fit indices do not reach the anticipated cutoffs: CFI is 0.94, RMSEA is 0.07, and SRMR is 0.08.
Complicating the model by adding a latent dimension to structure the original 22-item set (M4) does not yield substantial benefits. The Satorra-Bentler scaled difference is significant (SB-Dif between M3 and M4542.93, df54, p,0.000), but fit indices remain stable.
The three-dimensional models (M3A, M3C, and M3D), which preserve the original structure by deleting two items, do not show better features. The only solution that shows an increase compared to Maslach's is the one proposed by Schaufeli et al. [15], yielding a CFI of 0.95, and 0.06 for both RMSEA and SRMR. The model is more parsimonious (CAIC51298.91) and shows better expected cross validation (ECVI51.13). In this model (M3A), illustrated in Fig. 1, all parameters are associated with satisfying estimates and correlations between subscales represent a well-known profile in Italy [52,53,54]: EE and DP are positively associated and both have a weak negative correlation with PA. The loss of information caused by item elimination yields a measurement benefit only if the deleted items are 12 and 16, as pointed out by Schaufeli et al. [15], and often observed in various national contexts, including Italy [21]. The results of this model (M3A) confirm the indications of the item analysis conducted on the data reported in this study, showing the criticality of item 12,  [18]. The deletion of four items inevitably yields better parsimony (CAIC51145.72) but does not significantly increase its fit. M3C performance is similar to Maslach's but worse than Schaufeli's model. Gil-Monte's specification (M4A) [13], defined by splitting the same set of items selected by Schaufeli et al. [15] into a four-factor structure, where PA is divided into self-competence and existential components, increases fit in terms of chisquare (SB-Dif573.96, df53, p,0.000) and CFI (0.96), but not RMSEA and SRMR. There is no complete evidence that M4A outperforms M3A and, as Gil-Monte himself concludes, it is convenient to maintain the three-dimensional model because the correlation between the two factors is consistent (r.0.80), and these are clearly aspects of the same dimension [13].
Densten (M5) [12], continuing to increase complexity, separates psychological strain from somatic strain, distinguishes between personal accomplishment related to self and others, and eliminates three items (12, 13, and 14). This fivefactor oblique model seems to be a good interpretation of the between-item covariance in terms of model fit (particularly considering CFI50.96 and SRMR5.05), but the results include a quasi-perfect correlation between the two PA factors (self and others) and a correlation of .87 between the two EE dimensions (psychological and somatic strain). Densten's model includes a theoretically interesting suggestion for EE, although obtained by forcing items. To reach the distinction between psychological and physical stress, one is compelled to eliminate two items (13 and 14) that, on the contrary, prove to be completely homogeneous and discriminating for EE. Moreover, the remaining items are reaggregated, forcing their meaning: although item 1 states ''I feel emotionally drained from my work'' it is attributed to somatic strain.
M3B has the best model fit (CFI is 0.96, RMSEA 0.06, and SRMR 0.06). Kim and Ji [10] further reduced the items set by eliminating items 12 and 16, as in Schaufeli's model (M3), and item 2 due to its high covariance of error terms with item 1. Such covariance was not high in the present research dataset. Rather, item 2 yields the highest number of residuals between the covariances observed and those reproduced by the model: deleting this item from the set means eliminating the observed covariances that could not be explained on the basis of the three esteemed latent constructs. Examining item-error covariances in the Italian nurses sample, the highest one refers to the 10-11 couple (both DP items), and there are non-negligible item-error covariances between items 5 and 6 (respectively, DP and EE items), 17 and 18, and 18 and 19 (both couples on PA). In all cases, in the Italian version the item pairs contain lexical redundancy, are contiguous and refer to the same construct, with the exception of items 5 and 6, in which the formulation of item 6 is entirely compatible with a double loading on EE and DP. Therefore, it seems that the order of item presentation as well as item lexical noise and redundancy [36,37] may contribute to the level of observed covariances. However these covariances cannot be explained by the substantive dimensions of the burnout construct.

Discussion and Conclusions
Analyses performed on the Italian sample showed that A) the factorial structure underlying the MBI-HSS is three-dimensional and follows Maslach's model, even if B) the item set is the one suggested by Kim and Ji [10], deleting items 2, 12 and 16, or, preferably, by Schaufeli et al. [15] deleting only items 12 and 16. Results confirm the original construct dimensionality and subdimensions meaning, even if the three dimensional specification is compared to more complex (in terms of numbers of latent dimensions) or simpler (in terms of numbers of considered items) specifications, that might be more satisfying because increasing construct dimensionality or deleting items facilitate data reproduction and generally improve model fit. Alternative models of MBI-HSS considered in this study do not yield results that justify a redefinition of valid items, an excessive shortening of the scale or, even more, a theoretical redefinition of the syndrome.
In sum, a conclusive view of the inventory is that it actually measures three dimensions and that every data-driven re-specification of the model could result in an attempt to solve, a posteriori and by means of structural equations, problems not regarding construct validity but item redundancy, their lexical noise, desirability, or not less crucially, the order of item administration.
Potential limitations of the current study are the non-probabilistic nature of the sample, that is not intended to be representative, and the fact that this study was not specifically designed with the purpose of studying response style or bias potentially associated with MBI-HSS items, as items performances presume to be. Present data did not permit the control and estimation of the possible effect of items characteristics and order of presentation on participants' answers.
Considering the relevance and the worldwide diffusion of the MBI, further studies should: A) focus on the order of items presentation so as to determine if and how this aspect affects the functioning of the items and the scale; B) identify and rephrase those items which show bad-functioning across different linguistic and cultural contexts; C) rephrase items of the Italian version that sound excessively similar or that show meaning redundancy (i.e., the coupled items 6/16 and 10/11).