Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Improved measurement of tinnitus severity: Study of the dimensionality and reliability of the Tinnitus Handicap Inventory

  • Elżbieta Gos,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Writing – original draft

    Affiliation Department of Teleaudiology and Screening, World Hearing Center, Institute of Physiology and Pathology of Hearing, Warsaw, Poland

  • Adam Sagan,

    Roles Conceptualization, Formal analysis, Methodology, Software, Writing – original draft

    Affiliation Department of Market Analysis and Marketing Research, Faculty of Management, Cracow University of Economics, Cracow, Poland

  • Piotr H. Skarzynski,

    Roles Conceptualization, Funding acquisition, Resources, Writing – original draft, Writing – review & editing

    Affiliations Department of Teleaudiology and Screening, World Hearing Center, Institute of Physiology and Pathology of Hearing, Warsaw, Poland, Heart Failure and Cardiac Rehabilitation Department, Faculty of Medicine, Medical University of Warsaw, Warsaw, Poland, Institute of Sensory Organs, Kajetany, Poland

  • Henryk Skarzynski

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing

    Affiliation Department of Otorhinolaryngosurgery, World Hearing Center, Institute of Physiology and Pathology of Hearing, Warsaw, Poland

Improved measurement of tinnitus severity: Study of the dimensionality and reliability of the Tinnitus Handicap Inventory

  • Elżbieta Gos, 
  • Adam Sagan, 
  • Piotr H. Skarzynski, 
  • Henryk Skarzynski



The Tinnitus Handicap Inventory (THI) is widely used in clinical practice and research as a three-dimensional measure of tinnitus severity. Despite extensive use, its factor structure remains unclear. Furthermore, THI can be considered a reliable measure only if Cronbach’s alpha coefficient and Classical Test Theory is used. The more modern and robust Item Response Theory (IRT) has so far not been used to psychometrically evaluate THI. In theory, IRT allows a more precise evaluation of THI’s factor structure, reliability, and the quality of individual items.


There were 1115 patients with tinnitus (556 women and 559 men), aged 19–84 years (M = 51.55; SD = 13.28).

The dimensionality of THI was evaluated using several models of Confirmatory Factor Analysis and an Item Response Theory approach. Exploratory non-parametric Mokken scaling was applied to determine a unidimensional and robust scale. Several IRT polytomous models were used to assess the overall quality of THI.


The bifactor model had the best fit (RMSEA = 0.055; CFI = 0.976; SRMR = 0.040) and revealed one strong general factor and several weak specific factors. Mokken scaling generated a reliable unidimensional scale (Loevinger’s H = 0.463). In order to refine THI we propose that five items be removed. The IRT Generalized Partial Credit Model generated good parameters in terms of item location (difficulty), discrimination, and information content of items.


Our findings support the use of THI to evaluate tinnitus severity in terms of it being a reliable unidimensional scale. However, clinicians and researchers should rely only on its overall score, which reflects global tinnitus severity. To improve its psychometric quality, several refinements of THI are proposed.


In recent years there has been increasing interest among clinicians and healthcare providers in assessing patients’ health status using Patient Reported Outcome Measures (PROMs). A PROM instrument is any report of the status of a patient’s health that originates directly from the patient [1]. PROMs have been defined as health questionnaires which evaluate aspects of a patient’s health from the patient’s perspective [2].

PROMs are useful in clinical practice for diagnosis, choice of treatment, and monitoring changes. There is evidence that the use of PROMs improves patients’ satisfaction, allows monitoring of response to treatment, and detects unrecognized problems [3]. In clinical trials they serve as primary or secondary endpoints [4], and they are used in health systems and in health policymaking for assessing and improving quality of care [5, 6]. The scope of PROMs’ application is still expanding [7], and efforts have been made recently to ensure that the methodology of PROM use is clinically meaningful, valid, and reliable [811]; only then can they serve as effective instruments in enhancing healthcare quality.

PROMs are particularly useful in assessing subjective disorders which are not apparent to others but which are registered only through the complaints of the sufferers–and tinnitus is one such disorder. Tinnitus is the subjective perception of sound without any external acoustic stimulation, and is perceived as ringing in the ears, hissing, chirping, buzzing, or other sounds [1214]. Its prevalence ranges from 4% to 15% in adults [15], and 6% to 34% in children [1618]. Tinnitus is accompanied by a broad range of negative emotional symptoms, and significantly impacts on quality of life [19, 20]. Because of the limited effectiveness of audiological assessment and psychoacoustic measurement, self-reported rating scales and questionnaires are widely used in evaluating the severity of individually perceived tinnitus [2123], where severity is defined as the level of distress or impact that tinnitus has on the person [24]. There is no other option for measuring tinnitus severity other than with self-reporting measures (primarily multi-item questionnaires), which need to have acceptable psychometric quality.

There are many questionnaires used for assessing tinnitus severity [23]. The Tinnitus Handicap Inventory (THI) stands out among them–it is the most commonly used tool which has been validated in the largest number of languages [25]. THI was created to evaluate the impact of tinnitus on daily living [26], and is used as a screening tool for psychiatric disorders [27], and as an outcome measure for evaluating treatment effects in clinical trials [21, 2830]. There is a brief, time-efficient screening version which consists of only 10 items, and this has greatly increased the use of THI [31].

Although THI is a widespread tool, its factor structure remains unclear. Newman and colleagues originally postulated three factors–the Emotional, Functional, and Catastrophic subscales–but they were based on item content, not on factor analysis [26]. Factor analysis for THI was first reported for a Danish version of THI, but the study sample comprised only 50 tinnitus patients [32]. Exploratory factor analysis did not confirm a three-factor solution, indicating that only the THI total score should be used as a valid measure of tinnitus severity (not the scores on the three subscales). In 2003, Baguley and Andersson conducted exploratory factor analysis of THI in a group of 196 patients, and the analysis gave strong support for a unifactorial structure [33]. To date, more than a dozen factor structure validation studies of THI have been published, with study groups ranging from 50 [32, 34] to 373 patients [35]. The majority of these studies failed to demonstrate a three-factor structure [3638], although two of them did support the original three-factor solution [35, 39]. In particular, the German study seems very strong: in its confirmatory factor analysis it used a large sample of 373 tinnitus patients [35]. The findings showed that a three-factor model gave a better fit than a unidimensional model, and indicated that the three subscales of THI (Functional, Emotional, Catastrophic) were each valid and provided three distinct dimensions of tinnitus severity.

It is worth noting that work so far has used only a Classical Test Theory (CTT) approach, whereas a more modern and robust approach is now available–Item Response Theory (IRT). In this context, the factor structure of THI is not just an academic exercise but an important problem in clinical practice. It is crucial for a clinician or researcher to know which factor structure (three- or unidimensional) is appropriate to the situation and be confident they can rely on each subscale score or only on the total THI score.

The second issue which is critical to psychometric quality is reliability. The most popular index of reliability is Cronbach’s alpha coefficient, which is based on CTT [40]. All studies concerning psychometric properties of THI report alpha for both total scale and for subscales. Reliability across studies appears to be very high, mostly above 0.90. Across almost all studies, alpha for the Functional and Emotional subscales ranged from 0.8 to 0.9, while for the Catastrophic subscale it was lower, about 0.6–0.7. However, Cronbach’s alpha coefficient has numerous limitations [4143], and other more robust model-based indices of reliability have recently been proposed. Reliability estimates within CTT has some limitations–they are dependent on the particular sample and measurement error is the same across all level of the ability. IRT overcomes these limitations treating reliability as precision of measurement independent of the particular sample and enabling estimation of measurement error at any given level of a latent trait.

The present study has three goals:

  1. To examine the theoretical structure of tinnitus severity as measured by THI. Our starting hypothesis is that a unidimensional model best accounts for the structure of a measured construct.
  2. To determine the reliability of THI in a model-based approach which has so far not been used in psychometric studies of THI.
  3. To give guidance for a potential refinement of THI using Item Response Theory.



Our retrospective study used data from patients admitted to a tertiary referral ENT center in Poland over the period July 2015 to September 2018. Patients had reported problems with tinnitus as a primary complaint or secondary to hearing loss, and filling in THI was part of the standard diagnostic evaluation. Records of patients were retrospectively screened to check compliance with the eligibility criteria: age above 18 years, duration of tinnitus at least 1 month, documented hearing thresholds based on clinical pure tone audiometry, and a completed Tinnitus Handicap Inventory. The Institutional Review Board approved the protocol of the study (approval no. KB IFPS 18/2018). Due the retrospective nature of our evaluation, no written consent from the participants were gathered.


The Tinnitus Handicap Inventory (THI) comprises 25 items grouped into three subscales: Functional, Emotional, and Catastrophic. The Functional subscale (11 items) deals with limitations caused by tinnitus in the areas of mental, social, and physical functioning. The Emotional subscale (9 items) concerns affective responses to tinnitus, e.g. anger, frustration, depression, anxiety. The Catastrophic subscale (5 items) probes the most severe reactions to tinnitus, such as loss of control, inability to escape from tinnitus, and fear of having a terrible disease. For each item a patient can respond with a “yes” (scored 4 points), “sometimes” (2 points), or “no” (0 points). The responses are summed within each subscale and for the total scale. The higher the score, the greater the perceived tinnitus severity [26]. The Polish version of THI validated by Skarzynski et al. [38] was used in this study.


There were 1115 individuals (556 women and 559 men); their mean age was 51.6 years (SD = 13.3) and ranged from 19 to 84 years. The period of suffering from tinnitus varied from 1 month to 50 years (M = 6.6; SD = 7.7). Most frequently, the tinnitus was bilateral (57%), while 26% of the patients reported tinnitus in the left ear and 17% in the right.

Data analysis

The first step was to evaluate the dimensionality of THI, and here four CFA models were used: a unidimensional CFA, a second-order CFA, a bifactor CFA, and a three-dimensional CFA model with correlated factors. Weighted Least Square estimation with means and variance adjustment of Chi-square statistics (WLSMV) and Theta and Delta parameterization were applied. Taking into account that the THI items are ordinal categorical variables, polychoric correlation coefficients were used. The overall fit of a CFA model was considered adequate if its Root Mean Square Error of Approximation (RMSEA) was < 0.05, the Comparative Fit Index (CFI) was > 0.95, and the Standardised Root Mean Square Residual (SRMR) < 0.05 [44].

Model-based reliability was assessed by McDonald’s omega and the H-index, and the average variance extracted [45]. McDonald’s omega was calculated as both omega total (ω) and omega hierarchical (ωH), and for the bifactor model omega hierarchical of the subscales (ωHS) and Percentage of Reliable Variance (PRV) were also calculated [46]. An omega value above 0.80 was considered high [47]. Omega hierarchical above 0.75, in conjunction with a PRV above 75%, indicates a scale’s unidimensionality. Omega hierarchical subscale reflects the reliability of a subscale after controlling for the variance due to the general factor [48]. Average Variance Extracted (AVE) refers to the variance explained by a construct due only to measurement error. Fornel and Larcker stated it should be at least 0.5 [49]. The H-index is a measure of maximal reliability for an optimally-weighted scale, i.e. when each item contributes different information to the global score [50, 51]. The H-value was expected to have a minimum of 0.7.

Additional measures of dimensionality were applied in the bifactor model. Explained Common Variance (ECV) is an indicator of unidimensionality, with high ECV indicating a strong general factor compared to group factors [52]. Item Explained Common Variance (IECV) shows item-level variation attributed to a general factor [53]. ECV was used in conjunction with Percent of Uncontaminated Correlations (PUC). ECV > 0.70 and PUC > 0.70 suggest that a given construct is unidimensional [47]. Average Relative Parameter Bias (ARPB) occurs when multidimensionality is ignored and a unidimensional model is specified [47]. An ARPB less than 10–15% is considered acceptable [54].

The second step involved exploring non-parametric Mokken scaling to check for the monotonicity of items. Selection of the best items for unidimensional parametric IRT modeling was carried out via an automated item selection procedure using a genetic algorithm. In terms of the IRT approach, the scalability of the THI scale was measured using Loevinger’s H [55]. If the item scalability coefficients Hij > 0, Hi > 0.3, and H > 0.3 then this suggests a reliable, cumulative scale.

In the third step, three IRT polytomous models were used to assess unidimensional THI scale quality: the Rasch Model for polytomous items, the Generalized Partial Credit Model (GPCM, an extension of the Rasch model) with parameters for item discrimination and adjacent-category response functions [56], and the Graded Response Model (for ordered polytomous categories of a Likert scale and with cumulative category response functions) [57]. The overall fit was checked using the M2 statistic [58]. Marginal reliability was computed, given an estimated model and a prior density function; marginal reliability above 0.7 suggests an acceptable scale. The local independence assumption was checked using Yen’s Q3 statistic based on correlation of the residuals for a pair of items [59]. The final scale was developed on the basis of model-based reliability, item goodness of fit, and item information functions.

The sample size was calculated using power 0.80 and alpha level 0.05, assuming 3 latent variables, 25 observed variables, and an anticipated effect size of 0.1. The required minimum sample was 823 individuals. Statistical analyses were performed with IBM SPSS Statistics v.24, Mplus 8.2, and the mirt, ltm, eRm, and mokken libraries of the R package.


Basic statistics

Descriptive statistics for the THI items and its subscales are summarized in Table 1. The majority of correlations between individual items and the total score were above 0.5, making the whole scale seem reliable.

Dimensionality of CTT- and IRT-based measurement models

Before testing multidimensional models, CFA unidimensional analyses of the Functional, Emotional and Catastrophic subscales were conducted using WLSMV method.

For Functional subscale: χ2 (44) = 295.14; p < 0.001; RMSEA (Root Mean Square Error Of Approximation) = 0.072; CFI (Comparative Fit Index) = 0.978; SRMR (Standardized Root Mean Square Residual) = 0.042. After controlling for correlated errors (based on modification index) items THI7 with THI20, and THI7 with THI2, χ2 (42) = 247.06; p < 0.001; RMSEA = 0.066; CFI = 0.982; SRMR = 0.038.

For Emotional subscale: χ2 (27) = 234.24; p < 0.001; RMSEA = 0.083; CFI = 0.983; SRMR = 0.035.

After controlling for correlated errors items THI3 with THI14, THI25 with THI17 and THI25 with THI22, χ2 (24) = 111.02; p < 0.001; RMSEA = 0.057; CFI = 0.993; SRMR = 0.023.

For Catastrophic subscale: χ2 (5) = 82.81; p < 0.001; RMSEA = 0.118; CFI = 0.967; SRMR = 0.045.

After controlling for correlated errors items THI8 with THI19, the fit drastically has been improved: χ2 (4) = 5.83; p < 0.001; RMSEA = 0.020; CFI = 0.999; SRMR = 0.012.

Afterwards, four CFA models for the whole THI were tested and they are set out in Figs 14.

Results of dimensionality analysis and comparison of models of goodness of fit are shown in Table 2.

All the CTT models had acceptable goodness of fit, taking into account the values of fit indices. However, the bifactor model had a significantly better fit in comparison with the correlated factor model, which was slightly superior to the unidimensional model. In the family of IRT models, bifactor GPCM and unidimensional GPCM had the best fit (M2 statistic); however the SRMR of bifactor GPCM appeared to be too high. In summary, both CTT and IRT confirmatory models suggest a more detailed elaboration of the unidimensional and bifactor models is needed in order to verify the unidimensionality of THI.

Model reliability

Reliability was evaluated for the two best models: unidimensional and bifactor. Results are gathered together in Table 3.

The unidimensional model had acceptable reliability. The bifactor model showed high overall and sub-dimension reliability; however a unidimensional solution was most strongly supported. OmegaH = 0.945 showed that total score predominantly reflects a single general factor. Omegas for the subscales scores seemed to demonstrate high reliability for the THI sub-factors, but low values of ωHS indicated that almost all sub-scale score variance is due to the general factor and almost no variance is due to specific factors. It also indicated the heavy confounding of sub-scale reliability (reliabilities of sub-scales were overwhelmingly inflated). Also PRV values confirmed that the three subdimensions of the THI scale are questionable and suggest that the scale is undimensional. General ECV values also suggested the scale is unidimensional, with ECVs for sub-scales meaningless. The Difference ARP bias between the unidimensional scale and the general factor in the bifactor model was acceptable. Only PUC = 0.66 showed that there might be some multi-dimensionality in THI; however, it was not severe enough to disqualify the interpretation of the instrument as being primarily unidimensional. The individual explained common variance (IECV) indicated that almost all items well represent the unidimensional THI scale except items THI2 and THI19, which were less than 0.50. The best items for unidimensional THI scale having the highest IECV were THI21, THI11, THI16, THI7, THI5, THI6, THI17, THI20, THI23, THI22, and THI24.

In general, all criteria of dimensionality analysis (ωH, ωHS, PRV, ECV, PUC, and ARPB) gave sufficient support for scale unidimensionality. In the subsequent analysis, unidimensional IRT-based models are adopted to assess the monotonicity and quality of each THI item.

Exploratory Mokken model of the unidimensional THI scale

Having verified unidimensionality and the cumulative character of the THI scale, an exploratory nonparametric Mokken model was used to evaluate the scale’s monotonicity and to select items. All the item scalability coefficients Hij between pairs of items were positive (Hij > 0) and ranged between 0.127 (THI2–THI7) and 0.733 (THI5–THI10). THI2 and THI24 were regarded as the weakest items (Hi < 0.3). The Loevinger H for the total scale was 0.463 (SE = 0.011). Additional reliability measures (MS and LCRC) showed reliable unidimensional scale: MS = 0.909, LCRC = 0.949. Also, the Automated Item Selection Procedure (AISP) for the Mokken scale using a genetic algorithm confirmed unidimensionality, (except items THI2 and THI24). The relationships between Hi and IECV measures are plotted in Fig 5.

On the basis of existing sub-scales, model fit, Hi, and IECV values we propose a shortened unidimensional THI scale that consists of only the “best” items. The selection is based on linear ordering (Hellwig method) and the geometric average of Hi and IECV scores. Items THI2, THI8, THI13, THI19, and THI24 were thus removed from the original scale, and the 20 remaining items were selected for unidimensional parametric polytomous IRT models.

Item quality of IRT-based models

IRT analysis results for the three IRT models are summarized in Table 4.

The test information curves of compared models are given in Fig 6.

The Rasch model was rejected and the GPCM and GRM models seemed to be the most appropriate. The GPCM model was chosen for further analysis.

The reliability of all the models was above the threshold of 0.7 and between –2.5 and +2.5 standard deviations from the average level of the standardized latent trait. The GPCM model included 93.05% of respondents who fitted the model and it was selected for more detailed analysis of items and individual person’s reliability.

The Yen’s Q3 statistic was used to test the assumption of local independence. The mean value was –0.025 and Q3 ranged between –0.107 and 0.160. The mean Q3 value was less than the threshold value of 0.1 and indicated that the local independence assumption was valid. Additionally, correlations between standardized residuals correlations were calculated and they are gathered in Table 5. The mean value for residual correlations was -0.007 and they ranged between -0.5 to 0.14, and for only one pair of items it was rather high (-0.5).

The parameters of the GPCM model are given in Table 6. Item locations (difficulties) were calculated as an average of threshold parameters for item response categories (for three item categories, two thresholds exist).

Item difficulties ranged between –0.656 (THI6) and 0.798 (THI17), item discrimination between 0.703 (THI11) to 2.440 (THI21), and item information between 1.40 (THI11) and 4.88 (THI21). For those item information values between –2 and 2 standardized values of Θ (latent trait continuum), where the THI scale has the highest precision, the item information values were between 0.940 (THI11) and 4.74 (THI 21), which are shown in Fig 7.


Despite widespread use of THI, there are still doubts about its psychometric quality. The first doubt has to do with its unclear factor structure, which means it is not certain whether THI correctly gauges aspects of tinnitus severity. Originally, it was postulated that THI measures three domains of tinnitus severity: functional, emotional, and catastrophic. They were intended to be distinct, although strongly correlated [26].

Our findings do not support these assumptions. Our findings show that, for the clinical population, the original three-factor structure is not the best measure of tinnitus severity. Omega hierarchical sub-scale indices showed that the proportion of the total variance accounted for by the three subscales was, after controlling for the influence of general tinnitus severity, very small. Other indices (AVE, ECV, PUC, PRV, ARPB) showed that the common variance can be regarded as unidimensional, thus supporting one general factor and a unidimensional solution. These results are in line with our previous research [38] and they are also consistent with those obtained by others [32, 33, 36, 37]. This contrasts with the earlier German study of 373 tinnitus patients [35], which confirmed the three-factor structure of THI.

However, it should be noted that the German study compared only a general factor model and a first-order three-factor model. They did not consider a second-order three-factor model or a bifactor model. It is known that a bifactor model is useful for evaluating the validity of multi-item questionnaires which measure both the overall construct and its specific dimensions [47]. In our case, however, the results of bifactor modelling clearly demonstrated that there was a one factor solution. Our results demonstrate that THI should be considered a unidimensional scale, and that the Functional, Emotional, and Catastrophic subscales do not represent separate substantive latent traits. Instead, we believe these subscale share a large portion of overall general negative affectivity associated with tinnitus.

THI is generally considered to be a reliable tool. The claim about high reliability of THI subscales and overall score, demonstrated by several validating studies, is founded on the use of Cronbach’s alpha coefficient. But it is worth emphasizing, that reliability depends on a particular study population, while IRT offers in its place test information function, which shows the degree of precision at different values of the latent trait. Fig 7 clearly shows that the standard error of measurement (SEM) is the smallest in the middle of the scale and increases with higher and lower scores. So, the precision of measurement is the highest for the subjects with moderate tinnitus severity. When Cronbach’s alpha is embedded in CTT theory, it is assumed that SEM is constant along the scale, and this is, as we can see, an unfounded assumption. Other drawbacks of this index can be found elsewhere [4143]. Our findings demonstrate that THI is in fact reliable as a unidimensional scale (with no subscales) in our large sample tinnitus sufferers, and its precision of measurement is the highest for subjects with moderate complaints.

Mokken analysis confirmed the unidimensionality of THI and allows us to treat it as a reliable cumulative scale. On the basis on several combined criteria, we propose that five items (THI2, THI8, THI13, THI19, THI24) should be removed in order to refine the scale. Three of these excess items belong to the original Functional subscale, while two belong to the Catastrophic subscale. Of the remaining 20 items, the majority cover the emotional aspect of tinnitus. This allows the whole scale to be more consistent, but it does narrow the range of tinnitus which THI measures. Kennedy and colleagues [60] noted that THI, compared to other tinnitus-related questionnaires, contains a disproportionately large number of items related to psychological/emotional aspects of tinnitus. The results of our study also suggest that tinnitus severity as measured by THI captures mainly the emotional aspects of tinnitus. This may be either a disadvantage or an advantage, depending on whether THI is used in a clinical or research setting and the underlying goal.

We must admit, that application IRT models to the THI posed some difficulties. Model fit statistic (M2) was significant for all tested models. It needs some comment [61], just like significant χ2 test values in previous analyses. First of all, CTT and IRT models represent an accept-support approach to model testing, where many “near perfect” models tend to be falsely “rejected”. Secondly, the χ2 statistic is generally susceptible to sample size therefore RMSEA, incremental fit indices and inspection of residuals and residuals correlations were developed and used to support model fit. Thirdly, the IRT models are predominantly psychometric not pure statistical/econometric models, therefore are focused on quality of data (given IRT model) rather than quality of model itself and model improving through its far-reaching respecification. Additionally, the problem of local independence should be also addressed. We used Yen’s Q3 statistic, however as it was shown by Christensen et al. [62] a singular critical value for Q3 is not fully appropriate and local dependence should be rather considered relative to the average observed residual correlation.

A great advantage and practical application of IRT is in-depth analysis of individual items, which may be used in selecting items during development or refinement of a questionnaire. Item location (level of difficulty) reflects where along the scale the item functions best. Items displaying a low level of item location (e.g. THI6 –complaining a great deal about tinnitus) are the ‘easiest’ items, indicating endorsement of mild tinnitus severity, while items with high item location (e.g. THI17 –bad social relationship) are the ‘hardest’ and they target a higher level of tinnitus severity. Informative items and discrimination were highest for THI21 (depression), THI14 (irritation), THI12 (difficulty to enjoy life), THI23 (can no longer cope with tinnitus); while the lowest were for THI11 (having a terrible disease) and THI7 (trouble with sleep). IRT parameters indicate which items should be selected to optimize measurement precision and achieve the desired goal of the tool. Items providing more information on lower-level traits are suitable for gauging mild tinnitus severity, while items targeting higher-level traits should be selected to optimize measurement of high tinnitus severity, e.g. in monitoring change over time following treatment. Item information function of THI displayed in Fig 7 clearly shows that THI in its present form is good in assessing individuals in the range Θ = –1 to 1, i.e. those with a moderate level of tinnitus severity.

Our findings have important clinical and research implications. The unidimensional factor structure of THI allows clinicians to use the tool without unnecessary additional calculations for subscales, thus saving time. Clinicians or researchers should rely only on the global score, because validity of the three subscales (Functional, Emotional, Catastrophic) is questionable, as they appear to provide little information beyond the general factor (overall tinnitus severity). We conclude that the quality of THI in its current form (25 items) is not satisfactory. Newman and colleagues proposed a short version of THI consisting of only 10 items [31], but they were selected on the basis of just three criteria: a high item–total correlation, representativeness of the three content domains, and face validity. We find such criteria insufficient and propose refining the THI instrument by removing just those items with some degree of misfit. We think that short form questionnaires are essential in busy clinical practice and with extensive research protocols, and we recommend taking into account both the CTT and IRT approaches in constructing a short form of THI.

The strength of our THI study is the large sample of tinnitus patients–the largest assembled so far. Patients came from all over Poland to our tinnitus clinic, so the sample can be considered representative of individuals seeking help for tinnitus. However, it is true that a more heterogeneous sample (e.g. in terms of geographic origin) would reduce the potential selection bias that our data might have.

We admit that not all aspects of IRT analysis have been exhausted in this study. Differential Item Functioning (DIF) analysis was omitted due to constraints on the length of this paper. Therefore, we still are unable to say how to interpret between-group comparisons shown with THI (e.g. difference in tinnitus severity between women and men) as true difference or measurement artifact. Further research is needed to establish measurement invariance in various demographic settings and cross-cultural comparisons.

To conclude, the growth of patient-centered care requires high-quality data from Patient Reported Outcome Measures. Application of IRT theory enables more precise assessment of the THI measurement properties, so that clinicians and researchers can have more confidence about their diagnoses and the results of trials based on THI.

We hope our findings might encourage researchers to use the IRT approach to explore the psychometric properties of other tinnitus-related questionnaires. Done well, we expect it will improve the quality of measures based on patients’ perception of their ailment.


We acknowledge the practice staff and patients who participated in the study. We would also like to thank PhD Andrew Bell for proofreading the manuscript.


  1. 1. Food and Drug Administration. Guidance for Industry. FDA; 2009.
  2. 2. Cappelleri J, Zou K, Bushmakin A, Alvir J, Alemayehu D, Symonds T. Patient-Reported Outcomes: Measurement, Implementation and Interpretation. Chapman and Hall/CRC; 2014.
  3. 3. Chen J, Ou L, Hollis SJ. A systematic review of the impact of routine collection of patient reported outcome measures on patients, providers and health organisations in an oncologic setting. BMC Health Serv Res. 2013;13:211. pmid:23758898
  4. 4. Mercieca-Bebber R, King MT, Calvert MJ, Stockler MR, Friedlander M. The importance of patient-reported outcomes in clinical trials and strategies for future optimization. Patient Relat Outcome Meas. 2018;9:353–67. pmid:30464666
  5. 5. Garratt AM, Bjaertnes ØA, Krogstad U, Gulbrandsen P. The OutPatient Experiences Questionnaire (OPEQ): data quality, reliability, and validity in patients attending 52 Norwegian hospitals. Qual Saf Health Care. 2005;14(6):433–7. pmid:16326790
  6. 6. Nelson EC, Eftimovska E, Lind C, Hager A, Wasson JH, Lindblad S. Patient reported outcome measures in practice. BMJ. 2015;350:g7818. pmid:25670183
  7. 7. Snyder CF, Jensen RE, Segal JB, Wu AW. Patient-reported outcomes (PROs): putting the patient perspective in patient-centered outcomes research. Med Care. 2013;51(8 Suppl 3):S73–79.
  8. 8. Aaronson N, Alonso J, Burnam A, Lohr KN, Patrick DL, Perrin E, et al. Assessing health status and quality-of-life instruments: attributes and review criteria. Qual Life Res. 2002;11(3):193–205. pmid:12074258
  9. 9. Terwee CB, Bot SDM, de Boer MR, van der Windt DAWM, Knol DL, Dekker J, et al. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol. 2007;60(1):34–42. pmid:17161752
  10. 10. Valderas JM, Ferrer M, Mendívil J, Garin O, Rajmil L, Herdman M, et al. Development of EMPRO: a tool for the standardized assessment of patient-reported outcome measures. Value Health. 2008;11(4):700–8. pmid:18194398
  11. 11. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539–49. pmid:20169472
  12. 12. Møller AR. Introduction. In: Møller AR, Langguth B, DeRidder D, Kleinjung T, editors. Textbook of Tinnitus. New York: Springer-Verlag; 2011. pp. 3–7.
  13. 13. Tunkel DE, Bauer CA, Sun GH, Rosenfeld RM, Chandrasekhar SS, Cunningham ER, et al. Clinical practice guideline: tinnitus. Otolaryngol Head Neck Surg. 2014;151(2 Suppl):S1–40.
  14. 14. Jastreboff PJ. 25 years of tinnitus retraining therapy. HNO. 2015;63(4):307–11. pmid:25862626
  15. 15. Møller AR. Epidemiology of tinnitus in adults. In: Møller AR, Langguth B, DeRidder D, Kleinjung T, editors. Textbook of Tinnitus. New York: Springer-Verlag; 2011. pp. 29–37.
  16. 16. Savastano M. Characteristics of tinnitus in childhood. Eur J Pediatr. 2007;166(8):797–801. pmid:17109163
  17. 17. Skarzynski PH, Kochanek K, Skarzynski H, Senderski A, Szkielkowska A, Bartnik G, et al. Hearing Screening Program in School-Age Children in Western Poland. Int Adv Otol. 2011; 7(2):194–200.
  18. 18. Piotrowska A, Raj-Koziak D, Lorens A, Skarżyński H. Tinnitus reported by children aged 7 and 12 years. Int J Pediat Otorhinolaryngolog. 2015;79(8):1346–50.
  19. 19. Langguth B. A review of tinnitus symptoms beyond „ringing in the ears”: a call to action. Curr Med Res Opin. 2011;27(8):1635–43. pmid:21699365
  20. 20. Zeman F, Koller M, Langguth B, Landgrebe M. Which tinnitus-related aspects are relevant for quality of life and depression: results from a large international multicentre sample. Health Qual Life Outcomes. 2014;12:7. pmid:24422941
  21. 21. Langguth B, Goodey R, Azevedo A, Bjorne A, Cacace A, Crocetti A, et al. Consensus for tinnitus patient assessment and treatment outcome measurement: Tinnitus Research Initiative meeting, Regensburg, July 2006. Prog Brain Res. 2007;166:525–36. pmid:17956816
  22. 22. Meikle MB, Stewart BJ, Griest SE, Henry JA. Tinnitus Outcomes Assessment. Trends Amplif. 2008;12(3):223–35. pmid:18599500
  23. 23. Fackrell K, Hall D, Barry J, Hoare D. Tools for tinnitus measurement: development and validity of questionnaires to assess handicap and treatment effects. IN: Signorelli F, Turjman F, editors. Tinnitus: causes, treatment and short and long-term health effects. New York: Nova Biomedical; 2014. pp. 13–60.
  24. 24. Cima DRFF, Mazurek PDB, Haider H, Kikidis D, Lapira A, Noreña AJ, et al. A multidisciplinary European guideline for tinnitus: diagnostics, assessment, and treatment. HNO. 2019;67:10–42. pmid:30847513
  25. 25. Skarżyński Piotr H., Rajchel Joanna J., Gos Elżbieta, Dziendziel Beata, Kutyba Justyna, Świerniak Weronika, et al. A revised grading system for the Tinnitus Handicap inventory based on a large clinical population. Int J Audiol. 2020; 59(1):61–67. pmid:31608728
  26. 26. Newman CW, Jacobson GP, Spitzer JB. Development of the Tinnitus Handicap Inventory. Arch Otolaryngol Head Neck Surg. 1996;122(2):143–8. pmid:8630207
  27. 27. Salviati M, Macrì F, Terlizzi S, Melcore C, Provenzano A, Capparelli E, et al. The Tinnitus Handicap Inventory as a screening test for psychiatric comorbidity in patients with tinnitus. Psychosomatics. 2013;54(3):248–56. pmid:23219227
  28. 28. McCombe A, Baguley D, Coles R, McKenna L, McKinney C, Windle-Taylor P, et al. Guidelines for the grading of tinnitus severity: the results of a working group commissioned by the British Association of Otolaryngologists, Head and Neck Surgeons, 1999. Clin Otolaryngol Allied Sci. 2001;26(5):388–93. pmid:11678946
  29. 29. Gudex C, Skellgaard PH, West T, Sørensen J. Effectiveness of a tinnitus management programme: a 2-year follow-up study. BMC Ear Nose Throat Disord. 2009;9:6. pmid:19558680
  30. 30. Zeman F, Koller M, Figueiredo R, Aazevedo A, Rates M, Coelho C, et al. Tinnitus handicap inventory for evaluating treatment effects: which changes are clinically relevant? Otolaryngol Head Neck Surg. 2011;145(2):282–7. pmid:21493265
  31. 31. Newman CW, Sandridge SA, Bolek L. Development and psychometric adequacy of the screening version of the tinnitus handicap inventory. Otol Neurotol. 2008;29(3):276–81. pmid:18277308
  32. 32. Zachariae R, Mirz F, Johansen LV, Andersen SE, Bjerring P, Pedersen CB. Reliability and validity of a Danish adaptation of the Tinnitus Handicap Inventory. Scand Audiol. 2000;29(1):37–43. pmid:10718675
  33. 33. Baguley DM, Andersson G. Factor analysis of the Tinnitus Handicap Inventory. Am J Audiol. 2003;12(1):31–4. pmid:12894865
  34. 34. Oron Y, Sergeeva NV, Kazlak M, Barbalat I, Spevak S, Lopatin AS, et al. A Russian adaptation of the tinnitus handicap inventory. International Journal of Audiology. 2015;54(7):485–9. pmid:25620408
  35. 35. Kleinstäuber M, Frank I, Weise C. A confirmatory factor analytic validation of the Tinnitus Handicap Inventory. J Psychosom Res. 2015;78(3):277–84. pmid:25582803
  36. 36. Meng Z, Zheng Y, Liu S, Wang K, Kong X, Tao Y, et al. Reliability and validity of the chinese (mandarin) tinnitus handicap inventory. Clin Exp Otorhinolaryngol. 2012;5(1):10–6. pmid:22468196
  37. 37. Bolduc D, Désilets F, Tardif M, Leroux T. Validation of a French (Québec) version of the Tinnitus Handicap Inventory. Int J Audiol. 2014;53(12):903–9. pmid:25140601
  38. 38. Skarzynski PH, Raj-Koziak D, J Rajchel J, Pilka A, Wlodarczyk AW, Skarzynski H. Adaptation of the Tinnitus Handicap Inventory into Polish and its testing on a clinical population of tinnitus sufferers. Int J Audiol. 2017;56(10):711–5. pmid:28537137
  39. 39. Aqeel M, Ahmed A. Translation, Adaptation and Cross Language Validation of Tinnitus Handicap Inventory in Urdu. J Audiol Otol. 2017;22(1):13–9. pmid:29325390
  40. 40. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–334.
  41. 41. Schmitt N. Uses and abuses of coefficient alpha. Psychological Assessment. 1996;8(4):350–3.
  42. 42. Sijtsma K. On the Use, the Misuse, and the Very Limited Usefulness of Cronbach’s Alpha. Psychometrika. 2009;74(1):107–20. pmid:20037639
  43. 43. Dunn TJ, Baguley T, Brunsden V. From alpha to omega: a practical solution to the pervasive problem of internal consistency estimation. Br J Psychol. 2014;105(3):399–412. pmid:24844115
  44. 44. Hooper D, Coughlan J, Mullen M. Structural Equation Modelling: Guidelines for Determining Model Fit. Articles [Internet]. 1 styczeń 2008; Available from://
  45. 45. Sagan A. Analiza rzetelności skal w wielopoziomowych modelach pomiaru (Reliability Analysis in Multilevel Measurement Models). In: Prace Naukowe Uniwersytetu Ekonomicznego we Wrocławiu (Research Papers of Wrocław University of Economics). Wrocław: Uniwersytet Ekonomiczny we Wrocławiu; 2014. pp. 49–59.
  46. 46. Rodriguez A, Reise SP, Haviland MG. Evaluating bifactor models: Calculating and interpreting statistical indices. Psychol Methods. 2016;21(2):137–50. pmid:26523435
  47. 47. Rodríguez AC, Reise SP, Haviland MG. Applying Bifactor Statistical Indices in the Evaluation of Psychological Measures. J Pers Assess. 2016;98(3):223–37. pmid:26514921
  48. 48. Reise SP, Bonifay WE, Haviland MG. Scoring and modeling psychological measures in the presence of multidimensionality. J Pers Assess. 2013;95(2):129–40. pmid:23030794
  49. 49. Fornell C, Larcker DF. Evaluating Structural Equation Models with Unobservable Variables and Measurement Error. J Market Res. 1981;18(1):39–50.
  50. 50. Bentler P. Covariance Structure Models for Maximal Reliability of Unit-Weighted Composites. In: Lee S, editor. Handbook of Latent Variable and Related Models. New York: Elsevier; 2007. pp. 1–19.
  51. 51. Hancock G, Mueller R. Rethinking Construct Reliability within Latent Variable Systems. In: Cudeck R, du Toit S, Sorbom D, editors. Structural Equation Modeling: Present und Future—A Festschrift in Honor of Karl Joreskog. Lincolnwood, IL: Scientific Software International; 2001. pp. 195–216.
  52. 52. Reise SP. The rediscovery of bifactor measurement models. Multivar Behav Res. 2012;47(5):667–96.
  53. 53. Stucky BD, Thissen D, Edelen MO. Using Logistic Approximations of Marginal Trace Lines to Develop Short Assessments. Appl Psychol Meas. 2013; 37(1):41–57.
  54. 54. Muthén B, Kaplan D, Hollis M. On structural equation modeling with data that are not missing completely at random. Psychometrika. 1987;52(3):431–62.
  55. 55. Mokken RJ. A Theory and Procedure of Scale Analysis, With Applications in Political Research [Internet]. Reprint 2011. Berlin, Boston: De Gruyter Mouton; 2011. Available on:
  56. 56. Muraki E. A Generalized Partial Credit Model: Application of an Em Algorithm. ETS Research Report Series. 1992;(1):i–30.
  57. 57. Samejima F. Estimation of Latent Ability Using a Response Pattern of Graded Scores1. ETS Research Bulletin Series. 1968;1968(1):i–169.
  58. 58. Maydeu-Olivares A, Joe H. Limited- and Full-Information Estimation and Goodness-of-Fit Testing in 2n Contingency Tables. Journal of the American Statistical Association. 2005;100(471):1009–20.
  59. 59. Yen WM. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Appl Psychol Meas. 1984;8(2):125–45.
  60. 60. Kennedy V, Wilson C, Stephens D. Quality of life and tinnitus. Audiol Med 2009;2:29–40.
  61. 61. Barrett P. Structural equation modelling: Adjudging model fit. Pers Individ Dif. 2007;42(5):815–24.
  62. 62. Christensen KB, Makransky G, Horton M. Critical Values for Yen’s Q3: Identification of Local Dependence in the Rasch Model Using Residual Correlations. Appl Psychol Meas. 2017;41(3):178–94. pmid:29881087