Improved measurement of tinnitus severity: Study of the dimensionality and reliability of the Tinnitus Handicap Inventory

Objective The Tinnitus Handicap Inventory (THI) is widely used in clinical practice and research as a three-dimensional measure of tinnitus severity. Despite extensive use, its factor structure remains unclear. Furthermore, THI can be considered a reliable measure only if Cronbach’s alpha coefficient and Classical Test Theory is used. The more modern and robust Item Response Theory (IRT) has so far not been used to psychometrically evaluate THI. In theory, IRT allows a more precise evaluation of THI’s factor structure, reliability, and the quality of individual items. Method There were 1115 patients with tinnitus (556 women and 559 men), aged 19–84 years (M = 51.55; SD = 13.28). The dimensionality of THI was evaluated using several models of Confirmatory Factor Analysis and an Item Response Theory approach. Exploratory non-parametric Mokken scaling was applied to determine a unidimensional and robust scale. Several IRT polytomous models were used to assess the overall quality of THI. Results The bifactor model had the best fit (RMSEA = 0.055; CFI = 0.976; SRMR = 0.040) and revealed one strong general factor and several weak specific factors. Mokken scaling generated a reliable unidimensional scale (Loevinger’s H = 0.463). In order to refine THI we propose that five items be removed. The IRT Generalized Partial Credit Model generated good parameters in terms of item location (difficulty), discrimination, and information content of items. Conclusion Our findings support the use of THI to evaluate tinnitus severity in terms of it being a reliable unidimensional scale. However, clinicians and researchers should rely only on its overall score, which reflects global tinnitus severity. To improve its psychometric quality, several refinements of THI are proposed.


Introduction
In recent years there has been increasing interest among clinicians and healthcare providers in assessing patients' health status using Patient Reported Outcome Measures (PROMs). A PROM instrument is any report of the status of a patient's health that originates directly from the patient [1]. PROMs have been defined as health questionnaires which evaluate aspects of a patient's health from the patient's perspective [2].
PROMs are useful in clinical practice for diagnosis, choice of treatment, and monitoring changes. There is evidence that the use of PROMs improves patients' satisfaction, allows monitoring of response to treatment, and detects unrecognized problems [3]. In clinical trials they serve as primary or secondary endpoints [4], and they are used in health systems and in health policymaking for assessing and improving quality of care [5,6]. The scope of PROMs' application is still expanding [7], and efforts have been made recently to ensure that the methodology of PROM use is clinically meaningful, valid, and reliable [8][9][10][11]; only then can they serve as effective instruments in enhancing healthcare quality.
PROMs are particularly useful in assessing subjective disorders which are not apparent to others but which are registered only through the complaints of the sufferers-and tinnitus is one such disorder. Tinnitus is the subjective perception of sound without any external acoustic stimulation, and is perceived as ringing in the ears, hissing, chirping, buzzing, or other sounds [12][13][14]. Its prevalence ranges from 4% to 15% in adults [15], and 6% to 34% in children [16][17][18]. Tinnitus is accompanied by a broad range of negative emotional symptoms, and significantly impacts on quality of life [19,20]. Because of the limited effectiveness of audiological assessment and psychoacoustic measurement, self-reported rating scales and questionnaires are widely used in evaluating the severity of individually perceived tinnitus [21][22][23], where severity is defined as the level of distress or impact that tinnitus has on the person [24]. There is no other option for measuring tinnitus severity other than with self-reporting measures (primarily multi-item questionnaires), which need to have acceptable psychometric quality.
There are many questionnaires used for assessing tinnitus severity [23]. The Tinnitus Handicap Inventory (THI) stands out among them-it is the most commonly used tool which has been validated in the largest number of languages [25]. THI was created to evaluate the impact of tinnitus on daily living [26], and is used as a screening tool for psychiatric disorders [27], and as an outcome measure for evaluating treatment effects in clinical trials [21,[28][29][30]. There is a brief, time-efficient screening version which consists of only 10 items, and this has greatly increased the use of THI [31].
Although THI is a widespread tool, its factor structure remains unclear. Newman and colleagues originally postulated three factors-the Emotional, Functional, and Catastrophic subscales-but they were based on item content, not on factor analysis [26]. Factor analysis for THI was first reported for a Danish version of THI, but the study sample comprised only 50 tinnitus patients [32]. Exploratory factor analysis did not confirm a three-factor solution, indicating that only the THI total score should be used as a valid measure of tinnitus severity (not the scores on the three subscales). In 2003, Baguley and Andersson conducted exploratory factor analysis of THI in a group of 196 patients, and the analysis gave strong support for a unifactorial structure [33]. To date, more than a dozen factor structure validation studies of THI have been published, with study groups ranging from 50 [32,34] to 373 patients [35]. The majority of these studies failed to demonstrate a three-factor structure [36][37][38], although two of them did support the original three-factor solution [35,39]. In particular, the German study seems very strong: in its confirmatory factor analysis it used a large sample of 373 tinnitus patients [35]. The findings showed that a three-factor model gave a better fit than a unidimensional model, and indicated that the three subscales of THI (Functional, Emotional, Catastrophic) were each valid and provided three distinct dimensions of tinnitus severity.
It is worth noting that work so far has used only a Classical Test Theory (CTT) approach, whereas a more modern and robust approach is now available-Item Response Theory (IRT). In this context, the factor structure of THI is not just an academic exercise but an important problem in clinical practice. It is crucial for a clinician or researcher to know which factor structure (three-or unidimensional) is appropriate to the situation and be confident they can rely on each subscale score or only on the total THI score.
The second issue which is critical to psychometric quality is reliability. The most popular index of reliability is Cronbach's alpha coefficient, which is based on CTT [40]. All studies concerning psychometric properties of THI report alpha for both total scale and for subscales.
Reliability across studies appears to be very high, mostly above 0.90. Across almost all studies, alpha for the Functional and Emotional subscales ranged from 0.8 to 0.9, while for the Catastrophic subscale it was lower, about 0.6-0.7. However, Cronbach's alpha coefficient has numerous limitations [41][42][43], and other more robust model-based indices of reliability have recently been proposed. Reliability estimates within CTT has some limitations-they are dependent on the particular sample and measurement error is the same across all level of the ability. IRT overcomes these limitations treating reliability as precision of measurement independent of the particular sample and enabling estimation of measurement error at any given level of a latent trait.
The present study has three goals: 1. To examine the theoretical structure of tinnitus severity as measured by THI. Our starting hypothesis is that a unidimensional model best accounts for the structure of a measured construct.
2. To determine the reliability of THI in a model-based approach which has so far not been used in psychometric studies of THI.
3. To give guidance for a potential refinement of THI using Item Response Theory.

Design
Our retrospective study used data from patients admitted to a tertiary referral ENT center in Poland over the period July 2015 to September 2018. Patients had reported problems with tinnitus as a primary complaint or secondary to hearing loss, and filling in THI was part of the standard diagnostic evaluation. Records of patients were retrospectively screened to check compliance with the eligibility criteria: age above 18 years, duration of tinnitus at least 1 month, documented hearing thresholds based on clinical pure tone audiometry, and a completed Tinnitus Handicap Inventory. The Institutional Review Board approved the protocol of the study (approval no. KB IFPS 18/2018). Due the retrospective nature of our evaluation, no written consent from the participants were gathered.

Measures
The Tinnitus Handicap Inventory (THI) comprises 25 items grouped into three subscales: Functional, Emotional, and Catastrophic. The Functional subscale (11 items) deals with limitations caused by tinnitus in the areas of mental, social, and physical functioning. The Emotional subscale (9 items) concerns affective responses to tinnitus, e.g. anger, frustration, depression, anxiety. The Catastrophic subscale (5 items) probes the most severe reactions to tinnitus, such as loss of control, inability to escape from tinnitus, and fear of having a terrible disease. For each item a patient can respond with a "yes" (scored 4 points), "sometimes" (2 points), or "no" (0 points). The responses are summed within each subscale and for the total scale. The higher the score, the greater the perceived tinnitus severity [26]. The Polish version of THI validated by Skarzynski et al. [38] was used in this study.

Participants
There were 1115 individuals (556 women and 559 men); their mean age was 51.6 years (SD = 13.3) and ranged from 19 to 84 years. The period of suffering from tinnitus varied from 1 month to 50 years (M = 6.6; SD = 7.7). Most frequently, the tinnitus was bilateral (57%), while 26% of the patients reported tinnitus in the left ear and 17% in the right.

Data analysis
The first step was to evaluate the dimensionality of THI, and here four CFA models were used: a unidimensional CFA, a second-order CFA, a bifactor CFA, and a three-dimensional CFA model with correlated factors. Weighted Least Square estimation with means and variance adjustment of Chi-square statistics (WLSMV) and Theta and Delta parameterization were applied. Taking into account that the THI items are ordinal categorical variables, polychoric correlation coefficients were used. The overall fit of a CFA model was considered adequate if its Root Mean Square Error of Approximation (RMSEA) was < 0.05, the Comparative Fit Index (CFI) was > 0.95, and the Standardised Root Mean Square Residual (SRMR) < 0.05 [44]. Model-based reliability was assessed by McDonald's omega and the H-index, and the average variance extracted [45]. McDonald's omega was calculated as both omega total (ω) and omega hierarchical (ω H ), and for the bifactor model omega hierarchical of the subscales (ω HS ) and Percentage of Reliable Variance (PRV) were also calculated [46]. An omega value above 0.80 was considered high [47]. Omega hierarchical above 0.75, in conjunction with a PRV above 75%, indicates a scale's unidimensionality. Omega hierarchical subscale reflects the reliability of a subscale after controlling for the variance due to the general factor [48]. Average Variance Extracted (AVE) refers to the variance explained by a construct due only to measurement error. Fornel and Larcker stated it should be at least 0.5 [49]. The H-index is a measure of maximal reliability for an optimally-weighted scale, i.e. when each item contributes different information to the global score [50,51]. The H-value was expected to have a minimum of 0.7.
Additional measures of dimensionality were applied in the bifactor model. Explained Common Variance (ECV) is an indicator of unidimensionality, with high ECV indicating a strong general factor compared to group factors [52]. Item Explained Common Variance (IECV) shows item-level variation attributed to a general factor [53]. ECV was used in conjunction with Percent of Uncontaminated Correlations (PUC). ECV > 0.70 and PUC > 0.70 suggest that a given construct is unidimensional [47]. Average Relative Parameter Bias (ARPB) occurs when multidimensionality is ignored and a unidimensional model is specified [47]. An ARPB less than 10-15% is considered acceptable [54].
The second step involved exploring non-parametric Mokken scaling to check for the monotonicity of items. Selection of the best items for unidimensional parametric IRT modeling was carried out via an automated item selection procedure using a genetic algorithm. In terms of the IRT approach, the scalability of the THI scale was measured using Loevinger's H [55]. If the item scalability coefficients H ij > 0, H i > 0.3, and H > 0.3 then this suggests a reliable, cumulative scale.
In the third step, three IRT polytomous models were used to assess unidimensional THI scale quality: the Rasch Model for polytomous items, the Generalized Partial Credit Model (GPCM, an extension of the Rasch model) with parameters for item discrimination and adjacent-category response functions [56], and the Graded Response Model (for ordered polytomous categories of a Likert scale and with cumulative category response functions) [57]. The overall fit was checked using the M2 statistic [58]. Marginal reliability was computed, given an estimated model and a prior density function; marginal reliability above 0.7 suggests an acceptable scale. The local independence assumption was checked using Yen's Q 3 statistic based on correlation of the residuals for a pair of items [59]. The final scale was developed on the basis of model-based reliability, item goodness of fit, and item information functions.
The sample size was calculated using power 0.80 and alpha level 0.05, assuming 3 latent variables, 25 observed variables, and an anticipated effect size of 0.1. The required minimum sample was 823 individuals. Statistical analyses were performed with IBM SPSS Statistics v.24, Mplus 8.2, and the mirt, ltm, eRm, and mokken libraries of the R package.

Basic statistics
Descriptive statistics for the THI items and its subscales are summarized in Table 1. The majority of correlations between individual items and the total score were above 0.5, making the whole scale seem reliable.
Afterwards, four CFA models for the whole THI were tested and they are set out in Figs 1-4.
Results of dimensionality analysis and comparison of models of goodness of fit are shown in Table 2.
All the CTT models had acceptable goodness of fit, taking into account the values of fit indices. However, the bifactor model had a significantly better fit in comparison with the correlated factor model, which was slightly superior to the unidimensional model. In the family of IRT models, bifactor GPCM and unidimensional GPCM had the best fit (M2 statistic); however the SRMR of bifactor GPCM appeared to be too high. In summary, both CTT and IRT confirmatory models suggest a more detailed elaboration of the unidimensional and bifactor models is needed in order to verify the unidimensionality of THI.

Model reliability
Reliability was evaluated for the two best models: unidimensional and bifactor. Results are gathered together in Table 3. Capital letters represent items contained on the subscales: F-Functional, E-Emotional, C-Catastrophic.
Corrected item-total correlation is a correlation between the item and the scale score that excludes this item. Items excluded in subsequent analysis are in bold.
https://doi.org/10.1371/journal.pone.0237778.t001 The unidimensional model had acceptable reliability. The bifactor model showed high overall and sub-dimension reliability; however a unidimensional solution was most strongly supported. Omega H = 0.945 showed that total score predominantly reflects a single general factor. Omegas for the subscales scores seemed to demonstrate high reliability for the THI sub-factors, but low values of ω HS indicated that almost all sub-scale score variance is due to the general factor and almost no variance is due to specific factors. It also indicated the heavy confounding of sub-scale reliability (reliabilities of sub-scales were overwhelmingly inflated). Also PRV values confirmed that the three subdimensions of the THI scale are questionable and suggest that the scale is undimensional. General ECV values also suggested the scale is unidimensional, with ECVs for sub-scales meaningless. The Difference ARP bias between the unidimensional scale and the general factor in the bifactor model was acceptable. Only PUC = 0.66 showed that there might be some multi-dimensionality in THI; however, it was not severe enough to disqualify the interpretation of the instrument as being primarily unidimensional. The individual explained common variance (IECV) indicated that almost all items well represent the unidimensional THI scale except items THI2 and THI19, which were less than 0.50 . The best items for unidimensional THI scale having the highest IECV were THI21,  THI11, THI16, THI7, THI5, THI6, THI17, THI20, THI23, THI22, and THI24. In general, all criteria of dimensionality analysis (ω H , ω HS , PRV, ECV, PUC, and ARPB) gave sufficient support for scale unidimensionality. In the subsequent analysis, unidimensional IRT-based models are adopted to assess the monotonicity and quality of each THI item.  On the basis of existing sub-scales, model fit, H i , and IECV values we propose a shortened unidimensional THI scale that consists of only the "best" items. The selection is based on linear ordering (Hellwig method) and the geometric average of H i and IECV scores. Items THI2, THI8, THI13, THI19, and THI24 were thus removed from the original scale, and the 20 remaining items were selected for unidimensional parametric polytomous IRT models.

Item quality of IRT-based models
IRT analysis results for the three IRT models are summarized in Table 4.
The test information curves of compared models are given in Fig 6. The Rasch model was rejected and the GPCM and GRM models seemed to be the most appropriate. The GPCM model was chosen for further analysis.
The reliability of all the models was above the threshold of 0.7 and between -2.5 and +2.5 standard deviations from the average level of the standardized latent trait. The GPCM model included 93.05% of respondents who fitted the model and it was selected for more detailed analysis of items and individual person's reliability.
The Yen's Q 3 statistic was used to test the assumption of local independence. The mean value was -0.025 and Q 3 ranged between -0.107 and 0.160. The mean Q 3 value was less than the threshold value of 0.1 and indicated that the local independence assumption was valid. Additionally, correlations between standardized residuals correlations were calculated and they are gathered in Table 5. The mean value for residual correlations was -0.007 and they ranged between -0.5 to 0.14, and for only one pair of items it was rather high (-0.5).
The parameters of the GPCM model are given in Table 6. Item locations (difficulties) were calculated as an average of threshold parameters for item response categories (for three item categories, two thresholds exist).

Discussion
Despite widespread use of THI, there are still doubts about its psychometric quality. The first doubt has to do with its unclear factor structure, which means it is not certain whether THI correctly gauges aspects of tinnitus severity. Originally, it was postulated that THI measures three domains of tinnitus severity: functional, emotional, and catastrophic. They were intended to be distinct, although strongly correlated [26].
Our findings do not support these assumptions. Our findings show that, for the clinical population, the original three-factor structure is not the best measure of tinnitus severity. Omega hierarchical sub-scale indices showed that the proportion of the total variance accounted for by the three subscales was, after controlling for the influence of general tinnitus severity, very small. Other indices (AVE, ECV, PUC, PRV, ARPB) showed that the common variance can be regarded as unidimensional, thus supporting one general factor and a   THI1 THI3 THI4 THI5 THI6 THI7 THI9 THI10 THI11 THI12 THI14 THI15 THI16 THI17 THI18 THI20 THI21 THI22 THI23   THI1  unidimensional solution. These results are in line with our previous research [38] and they are also consistent with those obtained by others [32,33,36,37]. This contrasts with the earlier German study of 373 tinnitus patients [35], which confirmed the three-factor structure of THI. However, it should be noted that the German study compared only a general factor model and a first-order three-factor model. They did not consider a second-order three-factor model or a bifactor model. It is known that a bifactor model is useful for evaluating the validity of multi-item questionnaires which measure both the overall construct and its specific dimensions [47]. In our case, however, the results of bifactor modelling clearly demonstrated that there was a one factor solution. Our results demonstrate that THI should be considered a unidimensional scale, and that the Functional, Emotional, and Catastrophic subscales do not represent separate substantive latent traits. Instead, we believe these subscale share a large portion of overall general negative affectivity associated with tinnitus.
THI is generally considered to be a reliable tool. The claim about high reliability of THI subscales and overall score, demonstrated by several validating studies, is founded on the use of Cronbach's alpha coefficient. But it is worth emphasizing, that reliability depends on a particular study population, while IRT offers in its place test information function, which shows the degree of precision at different values of the latent trait. Fig 7 clearly shows that the standard error of measurement (SEM) is the smallest in the middle of the scale and increases with higher and lower scores. So, the precision of measurement is the highest for the subjects with moderate tinnitus severity. When Cronbach's alpha is embedded in CTT theory, it is assumed that SEM is constant along the scale, and this is, as we can see, an unfounded assumption. Other drawbacks of this index can be found elsewhere [41][42][43]. Our findings demonstrate that THI is in fact reliable as a unidimensional scale (with no subscales) in our large sample tinnitus sufferers, and its precision of measurement is the highest for subjects with moderate complaints. Mokken analysis confirmed the unidimensionality of THI and allows us to treat it as a reliable cumulative scale. On the basis on several combined criteria, we propose that five items (THI2, THI8, THI13, THI19, THI24) should be removed in order to refine the scale. Three of these excess items belong to the original Functional subscale, while two belong to the Catastrophic subscale. Of the remaining 20 items, the majority cover the emotional aspect of tinnitus. This allows the whole scale to be more consistent, but it does narrow the range of tinnitus which THI measures. Kennedy and colleagues [60] noted that THI, compared to other tinnitus-related questionnaires, contains a disproportionately large number of items related to psychological/emotional aspects of tinnitus. The results of our study also suggest that tinnitus severity as measured by THI captures mainly the emotional aspects of tinnitus. This may be either a disadvantage or an advantage, depending on whether THI is used in a clinical or research setting and the underlying goal.
We must admit, that application IRT models to the THI posed some difficulties. Model fit statistic (M2) was significant for all tested models. It needs some comment [61], just like significant χ 2 test values in previous analyses. First of all, CTT and IRT models represent an acceptsupport approach to model testing, where many "near perfect" models tend to be falsely "rejected". Secondly, the χ 2 statistic is generally susceptible to sample size therefore RMSEA, incremental fit indices and inspection of residuals and residuals correlations were developed and used to support model fit. Thirdly, the IRT models are predominantly psychometric not pure statistical/econometric models, therefore are focused on quality of data (given IRT model) rather than quality of model itself and model improving through its far-reaching respecification. Additionally, the problem of local independence should be also addressed. We used Yen's Q 3 statistic, however as it was shown by Christensen et al. [62] a singular critical value for Q 3 is not fully appropriate and local dependence should be rather considered relative to the average observed residual correlation.
A great advantage and practical application of IRT is in-depth analysis of individual items, which may be used in selecting items during development or refinement of a questionnaire. Item location (level of difficulty) reflects where along the scale the item functions best. Items displaying a low level of item location (e.g. THI6 -complaining a great deal about tinnitus) are the 'easiest' items, indicating endorsement of mild tinnitus severity, while items with high item location (e.g. THI17 -bad social relationship) are the 'hardest' and they target a higher level of tinnitus severity. Informative items and discrimination were highest for THI21 (depression), THI14 (irritation), THI12 (difficulty to enjoy life), THI23 (can no longer cope with tinnitus); while the lowest were for THI11 (having a terrible disease) and THI7 (trouble with sleep). IRT parameters indicate which items should be selected to optimize measurement precision and achieve the desired goal of the tool. Items providing more information on lower-level traits are suitable for gauging mild tinnitus severity, while items targeting higher-level traits should be selected to optimize measurement of high tinnitus severity, e.g. in monitoring change over time following treatment. Item information function of THI displayed in Fig 7 clearly shows that THI in its present form is good in assessing individuals in the range Θ = -1 to 1, i.e. those with a moderate level of tinnitus severity.
Our findings have important clinical and research implications. The unidimensional factor structure of THI allows clinicians to use the tool without unnecessary additional calculations for subscales, thus saving time. Clinicians or researchers should rely only on the global score, because validity of the three subscales (Functional, Emotional, Catastrophic) is questionable, as they appear to provide little information beyond the general factor (overall tinnitus severity). We conclude that the quality of THI in its current form (25 items) is not satisfactory. Newman and colleagues proposed a short version of THI consisting of only 10 items [31], but they were selected on the basis of just three criteria: a high item-total correlation, representativeness of the three content domains, and face validity. We find such criteria insufficient and propose refining the THI instrument by removing just those items with some degree of misfit. We think that short form questionnaires are essential in busy clinical practice and with extensive research protocols, and we recommend taking into account both the CTT and IRT approaches in constructing a short form of THI.
The strength of our THI study is the large sample of tinnitus patients-the largest assembled so far. Patients came from all over Poland to our tinnitus clinic, so the sample can be considered representative of individuals seeking help for tinnitus. However, it is true that a more heterogeneous sample (e.g. in terms of geographic origin) would reduce the potential selection bias that our data might have.
We admit that not all aspects of IRT analysis have been exhausted in this study. Differential Item Functioning (DIF) analysis was omitted due to constraints on the length of this paper. Therefore, we still are unable to say how to interpret between-group comparisons shown with THI (e.g. difference in tinnitus severity between women and men) as true difference or measurement artifact. Further research is needed to establish measurement invariance in various demographic settings and cross-cultural comparisons.
To conclude, the growth of patient-centered care requires high-quality data from Patient Reported Outcome Measures. Application of IRT theory enables more precise assessment of the THI measurement properties, so that clinicians and researchers can have more confidence about their diagnoses and the results of trials based on THI.
We hope our findings might encourage researchers to use the IRT approach to explore the psychometric properties of other tinnitus-related questionnaires. Done well, we expect it will improve the quality of measures based on patients' perception of their ailment.