Calibration and Validation of the Dutch-Flemish PROMIS Pain Interference Item Bank in Patients with Chronic Pain

The Dutch-Flemish PROMIS Group translated the adult PROMIS Pain Interference item bank into Dutch-Flemish. The aims of the current study were to calibrate the parameters of these items using an item response theory (IRT) model, to evaluate the cross-cultural validity of the Dutch-Flemish translations compared to the original English items, and to evaluate their reliability and construct validity. The 40 items in the bank were completed by 1085 Dutch chronic pain patients. Before calibrating the items, IRT model assumptions were evaluated using confirmatory factor analysis (CFA). Items were calibrated using the graded response model (GRM), an IRT model appropriate for items with more than two response options. To evaluate cross-cultural validity, differential item functioning (DIF) for language (Dutch vs. English) was examined. Reliability was evaluated based on standard errors and Cronbach’s alpha. To evaluate construct validity correlations with scores on legacy instruments (e.g., the Disabilities of the Arm, Shoulder and Hand Questionnaire) were calculated. Unidimensionality of the Dutch-Flemish PROMIS Pain Interference item bank was supported by CFA tests of model fit (CFI = 0.986, TLI = 0.986). Furthermore, the data fit the GRM and showed good coverage across the pain interference continuum (threshold-parameters range: -3.04 to 3.44). The Dutch-Flemish PROMIS Pain Interference item bank has good cross-cultural validity (only two out of 40 items showing DIF), good reliability (Cronbach’s alpha = 0.98), and good construct validity (Pearson correlations between 0.62 and 0.75). A computer adaptive test (CAT) and Dutch-Flemish PROMIS short forms of the Dutch-Flemish PROMIS Pain Interference item bank can now be developed.


Introduction
The prevalence of chronic pain is high in western populations, ranging from 10.1 to 55.2% [1][2][3]. Chronic pain is defined as pain that persists beyond the normal tissue healing time, in which the most prevalent pain is musculoskeletal pain, with prevalence varying from 30-40% for low back pain, 15-20% for shoulder-and neck pain, 10-15% for chronic widespread pain and 2% for fibromyalgia [3,4]. Chronic pain often leads to substantial limitations in daily activities [4]. Pain interference refers to the degree to which pain interferes with or limits person's social, mental and physical activities [5]. Self-reported pain interference has increasingly become an important indicator of the experiences of patients with pain and has recently been recommended as a core outcome in international core sets [6,7]. Consequently, pain interference is an important construct to measure in patients with chronic pain.
The National Institutes of Health (NIH) Patient-Reported Outcomes Measurement Information System (PROMIS 1 ) initiative has developed a dynamic assessment system for measuring patient-reported health [8][9][10]. Included in this system is an item bank that targets pain interference. An item bank is a set of items that measure the same construct and whose parameters have been estimated using an IRT model. Both the item parameters and the person's parameters are placed on the same underlying metric. Item banks can be used to tailor the assessment to individual trait levels using computerized adaptive testing (CAT) [10]. In an IRT-based CAT, the successive items are chosen based on given answers to previous items. Because of this tailored administration of items, individuals only respond to a minimal number of relevant items.
To develop the PROMIS Pain Interference item bank, items from existing PROMs were collected, combined and revised and new items were developed to ensure the full range of the construct was covered [11]. PROMIS item banks and CATs have been shown to have strong content validity, good responsiveness and other desirable psychometric properties, and have the potential to be implemented worldwide [12][13][14][15]. Furthermore, PROMIS scores are easier to interpret than traditional Patient-Reported Outcome Measures (PROMs) scores, because the PROMIS scores are expressed on a standardized T-score metric.
The Dutch-Flemish PROMIS Group translated 17 adult PROMIS item banks and 9 paediatric PROMIS item banks into Dutch-Flemish (to accommodate the Dutch-speaking part of Belgium in addition to those in The Netherlands), including the PROMIS Pain Interference item bank. Details of this work have been published [16,17].
The first aim of the current study was to calibrate the Dutch-Flemish PROMIS Pain Interference item bank based on responses to items by Dutch patients with chronic pain. The second aim was to evaluate the cross-cultural validity between scores on the Dutch-Flemish and the United States (US) PROMIS Pain Interference item bank. The third study aim was to evaluate the reliability and construct validity of the Dutch-Flemish PROMIS Pain Interference item bank scores.

Methods
The study was approved by the local institutional review board (Medical Ethical Committee Slotervaart hospital and Reade). To be eligible, patients had to provide written informed consent. retrospectively, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of the authors are articulated in the 'author contributions' section.
Competing Interests: Due to international collaboration, twelve persons co-author this paper. All authors have read and approved the final manuscript. Jaap van Leeuwen is an employee of Leones Group BV. Dennis Revicki is an employee of Outcomes Research. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials. For this study, 2808 patients from the Amsterdam Pain (AMS-PAIN) cohort were invited to  participate. The AMS-PAIN cohort consists of chronic pain patients who have been registered  since September 2010 in Reade; an outpatient secondary care center for rheumatology and rehabilitation in the Netherlands. To be eligible, patients had to have at least one chronic pain condition for at least three months prior to participating in the study and had to be 21 years or older. For evaluating the cross-cultural validity (or measurement equivalence) of the Dutch-Flemish versus the US PROMIS Pain Interference item bank, data from the US PROMIS Pain Interference American Chronic Pain Association (ACPA) sample was used. The ACPA sample consists of 967 patients with chronic pain who completed the PROMIS Pain Interference item bank [5]. All ACPA chronic pain patients met study eligibility criteria of being 21 years or older and having at least one chronic pain condition for at least three months prior to participating in the US PROMIS Wave 1 study [5].

Procedures
Patients from the AMS-PAIN cohort were invited by e-mail or letter, to fill in a web-based (digital) or paper-and-pencil (paper) questionnaire that included, among other measures, the full Dutch-Flemish PROMIS Pain Interference item bank. For the digital questionnaire, patients received personal login codes. Patients who were unable to complete the digital questionnaire were asked to complete the paper version.

Measures
The questionnaire included the full Dutch-Flemish PROMIS Pain Interference item bank. The translation of the US PROMIS Pain Interference item bank into Dutch-Flemish was performed by Functional Assessment of Chronic Illness Therapy multilingual translation (FACITtrans) using standardized methodology and approved by the PROMIS Statistical Center [16,18]. This translation included multiple forward and back translations, independent reviews and pilot testing with cognitive debriefing among 70 Dutch and Flemish adults [16]. The Dutch-Flemish PROMIS Pain Interference item bank contains 40 items covering a wide range of pain interferences content [5]. The time frame is the past 7 days. There are three different 5-point likert response scales: 1) not at all/a little bit/somewhat/quite a bit/very much; 2) never/rarely/sometimes/often/always; 3) never/once a week or less/once every few days/once a day/every few hours [5]. Demographic information also was collected (i.e. age, gender, country of birth, educational level).
In addition, the questionnaire contained five legacy instruments including the pain intensity item (Global07) from the Dutch-Flemish PROMIS Global Health item bank (an 11-point numeric rating scale (NRS) with 0 = 'no pain' and 10 = 'worst pain imaginable') [19]. Four reliable and valid condition-specific instruments were also included. The "Neck Disability Index" (NDI) consists of 10 items measuring self-reported pain intensity and the influence of neck pain on daily activities, with a total score ranging from 0 to 50 [20,21]. Evidence has accumulated for the reliability and validity of the NDI within Dutch patients with chronic neck pain [21][22][23][24]. The "Disabilities of the Arm, Shoulder and Hand" (DASH) questionnaire was used for patients with chronic shoulder pain. The DASH consists of 30 items measuring disabilities of the upper extremities, with a total score ranging from 0 to 100 [25,26]. DASH scores have demonstrated good reliability and validity in Dutch patients with a variety of disorders of the upper limb [25,[27][28][29]. The "Roland Morris Disability Questionnaire" (RMDQ) consists of 24 items measuring disabilities as a result of chronic back pain, with a total score ranging from 0 to 24 [30,31]. RMDQ scores have demonstrated good reliability and validity within Dutch patients with chronic low back pain [30,[32][33][34]. The fourth condition-specific legacy instrument was the "Fibromyalgia Impact Questionnaire" (FIQ), used for patients with fibromyalgia. The FIQ consists of 20 items measuring physical disabilities as a result of fibromyalgia, with a total score ranging from 0 to 100 [35,36]. FIQ scores have demonstrated moderate to good reliability and validity among Dutch patients with fibromyalgia [36,37]. For each legacy instrument higher scores indicate more intensity, disability or impact.

Statistical analysis
Calibration of the Dutch-Flemish PROMIS Pain Interference item bank. The psychometric analyses were conducted using the PROMIS analysis plan [10]. Similar analyses were done as for the calibration of the Dutch-Flemish PROMIS Pain Behavior item bank [38]. To evaluate the measurement properties of the pain interference items, IRT based analyses were used. IRT models estimate the relationship between an item response category and the level of the measured construct, in this study the level of pain interference. Before calibrating the item parameters of the Dutch-Flemish PROMIS Pain Interference item bank, the three IRT assumptions, unidimensionality, local independence and monotonicity, were evaluated [10].
Unidimensionality was examined using Confirmatory Factor Analyses (CFA) in which all items were hypothesized to load on a single factor. The analysis was performed using the Rpackage (version 3.0.1) Lavaan (version 0.5-16), and model fit was evaluated based on the Comparative Fit Index (CFI), Tucker Lewis Index (TLI) and Root Means Square Error of Approximation (RMSEA) [39]. We used recommended criteria for unidimensionality (CFI >0.95, TLI >0.95, and RMSEA <0.06) [10]. Furthermore, unidimensionality was considered sufficient when the first factor accounts for at least 20% of the variability and when the ratio of the variance explained by the first to the second factor was greater than 4 [10,40]. This was examined with exploratory factor analysis (EFA).
Another IRT assumption is local independence, which means that after controlling for the dominant factor, there should be no significant covariance among item responses. Local dependency was evaluated by examining the residual correlation matrix resulting from the single factor CFA. Residual correlations greater than 0.2 were considered indicators of possible locally dependence [10]. In addition, local independence was studied using Yen's Q3 statistic [41]. This statistic calculates the residual item scores under the graded response model (GRM) and correlates these among items. Cohen's rules of thumb were used for correlation effect sizes [42]. In this, Q3 values between 0.24 and 0.36 are moderate deviations, and values of 0.37 and greater represent large deviations. The impact of local dependency on IRT parameter estimates was evaluated, by removing the locally dependent items one by one and examining changes in the IRT parameters of the remaining items [10].
A third IRT assumption is monotonicity, in which the probability of endorsing a higher item response category should increase (or at least not decrease) with increasing levels of the underlying construct. Monotonicity of the Dutch-Flemish PROMIS Pain Interference items was evaluated by fitting a non-parametric IRT model, using Mokken scaling in the R-package Mokken [43,44]. This model yields nonparametric IRT response curve estimates, shows the probabilities of endorsing response categories and can be visually inspected to evaluate monotonicity.
After evaluation of the IRT assumptions, a GRM was fit to the item response data using the R-package Ltm [45,46]. The GRM models two item parameters, the item thresholds and the item slope [10]. Item threshold parameters indicate item difficulty, locate items along the measured trait, and show the coverage across the pain interference continuum. The item slope parameter represents the discriminative ability of the items, with higher slope values indicating better ability to discriminate between adjoining values on the construct.
To assess the fit of the GRM and the degree in which possible misfit affects the IRT model, S-X 2 statistic was used [47]. This statistic compares the observed and expected response frequencies under the estimated IRT model, and quantifies the differences between the observed and expected response frequencies. Items with a S-X 2 p-value of less than 0.001, were considered to have poor fit [10,47].
Differential item functioning within the Dutch AMS-PAIN sample. Differential item functioning (DIF) analyses are used to examine if people from different groups (e.g. age or gender) with the same level of trait (in this study the same level of pain interference) have different probabilities of giving a certain response to an item [10,48,49]. There are two kinds of DIF: uniform and non-uniform [10,48,49]. Uniform DIF exists when the DIF is consistent, with the same magnitude of DIF across the entire range of the trait. Non-uniform DIF exists when the magnitude or direction of DIF differs across the trait. DIF was evaluated with use of the R package Lordif (version 0.2-2) using ordinal logistic regression models with a McFadden's pseudo R 2 change of 2% as critical value [10,50,51]. In this portion of the study, we used this method to evaluate DIF within the Dutch AMS-PAIN sample based on age (Median split: under 50 years vs. 50 years and over), gender (male vs. female), and administration mode (digital vs. paper).
Reliability. Reliability within IRT is conceptualized as "information", in which the fact that measurement precision can differ across levels of the measured trait (θ = Theta) is taken into account. The relationship between information and standard error (SE) is defined by the formula: SEðyÞ ¼ 1 ffiffiffiffiffi IðyÞ p , where SE is the standard error of estimated θ, I is information, and θ is the estimated trait level (ranging from no or mild pain interference to high levels of pain interference) [5,11]. The formula indicates that increased scale information is related to smaller SE's and, therefore, greater measurement precision. Using the calculated SEs, plots were overlaid showing SE (as an indicator of reliability) across the score range of the 4-item short form (v1.0.4a), the 8-item short form (v1.0.8a), the 8-item simulated CAT (always 8 items, no other stopping rules), and the total item bank. The 8-item simulated CAT was conducted with use of the R-package catR (version 3.4) [52]. The IRT theta scores of the Dutch AMS-PAIN sample were transformed into T-scores anchored on the US item parameters cue sheet of the US PRO-MIS Pain Interference item bank [5]. In which T-score 50 represents the average score of the general US population, with a standard deviation of 10. For the total item bank Cronbach's alpha was calculated.
Cross-cultural validity. Differences in descriptive characteristics between the Dutch AMS-PAIN patients and the US ACPA patients were evaluated with use of independent samples t-tests and Chi square-tests, for continuous and categorical variables respectively.
For the evaluation of cross-cultural validity of the Dutch-Flemish PROMIS Pain Interference item bank versus the US PROMIS Pain Interference item bank, DIF for language (Dutch vs. English) was analysed, with use of the R package Lordif (version 0.2-2), using ordinal logistic regression models with a McFadden's pseudo R 2 change of 2% as critical value [10,50,51]. When items were flagged as potential DIF items, the wABC effect size index was computed [53]. Furthermore, the impact of DIF was examined by plotting item characteristic curves (ICC) (not shown) and test characteristic curves (TCC). The TCC plots showed the scores for all 40 Pain Interference items (ignoring DIF), and the scores for only the items having DIF [51].
Construct validity. Construct validity of the Dutch-Flemish PROMIS Pain Interference item bank was evaluated by correlating the T-scores of the Dutch-Flemish PROMIS Pain Interference item bank to the (total) scores on the legacy instruments (Dutch-Flemish PROMIS Global Health pain intensity item score, NDI, DASH, RMDQ, and FIQ). Construct validity was evaluated using Pearson correlations. We hypothesized that the Dutch-Flemish PROMIS Pain Interference item bank scores would have high correlations (r >0.50) with all the legacy instruments.

Participants
Of the 2808 invited patients of the Dutch AMS-PAIN cohort, 1140 responded to the questionnaire (response rate 40.6%). No differences were found between responders and non-responders on age, gender, country of birth, or education level. Among the 1140 respondents, 29 patients were excluded because they did not give informed consent and 26 patients responded to none of the items of the Dutch-Flemish PROMIS Pain Interference item bank, leaving N = 1085 patients. Because the GRM analyses can accommodate incomplete data, all 1085 were used for the IRT calibration. All other analyses were based on responses of the 973 patients with complete data.
The demographic characteristics of the Dutch AMS-PAIN sample and the US ACPA chronic pain sample are summarized in Table 1. Of the AMS-PAIN patients, 78% (n = 846) were female and the average age (SD) was 49 years (13) with a range from 21 to 85. Fifty-seven percent (n = 621) of these were born in the Netherlands, and 82% had at least a high school degree. Of the AMS-PAIN patients, 83% (n = 891) indicated that the duration of their pain was more than 2 years, and the average pain intensity on a NRS (SD) was 6.6 (2). Patients reported having chronic low back pain (71%), chronic neck or shoulder pain (70%), fibromyalgia (35%), chronic widespread pain (47%), migraine or other chronic headache (35%), and osteoarthritis (35%). Twelve percent of the chronic pain patients reported having rheumatoid arthritis and 2% reported cancer. No differences were found in age, gender or pain intensity between the Dutch AMS-PAIN sample and the US ACPA sample. However, there were some differences in educational levels, pain duration and type of chronic pain condition. Slightly more Dutch AMS-PAIN patients reported pain duration of 1-2 years. The US ACPA sample was more educated with 97% reporting high school education or more, while in the Dutch AMS-PAIN sample 82% reported high school education level or more.

Calibration of the Dutch-Flemish PROMIS Pain Interference item bank
The CFA results indicated good fit to a unidimensional model. The CFI was 0.986 and the TLI was 0.986, which are above the criterion of >0.95 [10]. However, the RMSEA was 0.159, which is somewhat larger than the criterion of <0.06. For the 8-item short form (v1.0.8a), the CFI was 0.996, the TLI 0.995 and the RMSEA 0.161. The first factor in EFA accounted for 66% of the variance, and the second factor accounted for 5% of the variance; hence the ratio of the variance explained by the first to the second factor is 13, which favourably exceeds the published criterion of 4 [10]. Based on these results, it was concluded that the Dutch-Flemish PROMIS Pain Interference items share a single common factor, and are sufficiently unidimensional for modelling using the GRM.
Examining the residual correlation matrix showed a small number of local dependent item pairs. Twenty-five out of the 780 items pairs (3.2%) had residual correlations greater than 0.2. Yen's Q3 statistic values of 49 of the 780 item pairs (6.3%) indicated at least a moderate deviation of model fit. The item pairs with the greatest dependency were PAININ42 ("How often did pain prevent you from standing for more than one hour?")-PAININ47 ("How often did pain prevent you from standing for more than 30 minutes?") with residual correlation of 0.77, and PAININ50 ("How often did pain prevent you from sitting for more than 30 minutes?")-PAI-NIN55 ("How often did pain prevent you from sitting for more than one hour?") with residual correlation of 0.70. These items were removed one by one and then the impact on item parameters of the remaining items was evaluated. After removal of item 42 or 47, the mean difference in item thresholds of the remaining items was 0.004, the mean difference in the slope parameters was 0.01, and the residual correlation of the next highest item pair was 0.254. The mean difference in item thresholds after removal of item 50 or 55 was 0.15, the mean difference in slope parameters was 0.12, and the residual correlation of the next highest item pair was 0.265. These results suggest minimal impact of local dependence. The Mokken scalability coefficient of the full pain interference item bank was 0.63, suggesting strong scalability according to published criteria [43,44,54]. All of the items had a scalability coefficient that was higher than the lower bound of 0.30. Based on these results, it was concluded that the Dutch-Flemish PROMIS Pain Interference items met the assumption of monotonicity. Table 2 summarizes the IRT item parameters of the Dutch-Flemish PROMIS Pain Interference items. The item threshold parameters ranged from -3.04 to 3.44. The item slope parameters ranged from 0.95 to 3.02. The item with lowest discrimination parameter was PAININ54 ("How often did pain keep you from getting into a standing position?"), and the item with the  highest discrimination parameter was PAININ31 ("How much did pain interfere with your ability to participate in social activities?").
The probability values for the S-X 2 statistics ranged from 0.0006 to 0.9725. Based on the S-X 2 p-value of less than 0.001, only 2 out of 40 items (PAININ13 and PAININ42) were found to misfit the GRM.

Differential item functioning by language
None of the Dutch-Flemish PROMIS Pain Interference items were flagged for DIF for age, gender, or administration mode.

Reliability
As shown in Table 1

Cross-cultural validity
The analysis of DIF for language, flagged 2 items with some level of uniform DIF (see Table 2): PAININ24 (R 2 = 0.052, wABC = 0.597) and PAIN32 (R 2 = 0.025, wABC = 0.423). For both items (PAININ24: "How often was pain distressing to you?" and PAININ32: "How often did pain make you feel discouraged?") the Dutch patients were more likely to endorse lower response categories compared to the US patients who were at the same level of the trait.
The overall impact of DIF for language on the TCC is shown in Fig 2. The left graph shows the TCC for all 40 Pain Interference items (ignoring DIF), and the right graph shows the TCC for just the 2 items having DIF. These curves show that the Pain Interference total score is only slightly lower for Dutch patients than for US patients, indicating minimal impact of DIF by language. In fact, as the right hand figure shows that accounting for DIF in the two flagged items would change the score on the full bank by less than a half point.

Discussion
The aim of current study was to calibrate the Dutch-Flemish PROMIS Pain Interference item bank in Dutch patients with chronic pain and to evaluate cross-cultural validity of the Dutch-Flemish compared to the US PROMIS Pain Interference item bank. The reliability and construct validity of the Dutch-Flemish PROMIS Pain Interference item bank were evaluated. The results supported the unidimensionality, model fit, and breadth of coverage of the Dutch-  Flemish pain interference bank. Furthermore, the analyses showed no evidence for DIF due to age, gender, or administration mode; and, scores exhibited good cross-cultural validity, reliability, and construct validity. This study is the first calibration study of the Dutch-Flemish PRO-MIS Pain Interference item bank.
The Dutch-Flemish PROMIS Group aims to improve the measurement of patient reported outcomes in the Netherlands and Flanders (the Dutch-speaking part of Belgium) by providing and supporting the implementation of IRT-based, efficient, highly reliable and valid PROMIS item banks and CATs [16]. PROMIS item banks and CATs have better content validity compared to traditional PROMs [13]. PROMIS item banks are based on a well-developed conceptual model with clearly defined unidimensional constructs and have been developed using extensive qualitative research with patients [15]. PROMIS item banks show good measurement properties; they have small measurement errors and show better responsiveness compared to more traditional PROMs [12,14,15]. This makes the use of PROMIS item banks in daily clinical practise more suitable than traditional PROMs. Increased responsiveness results in reduction of sample sizes needed in clinical studies [12,14]. Through the use of IRT-based methods, PROMIS item bank and CAT scores approximate an interval scale instead of an ordinal scale, and therefore are easier to interpret than scores of more traditional PROMs [55,56]. The PRO-MIS scores are expressed on a common standardized T-score metric, and because they are calibrated using an IRT model, the T-scores can be estimated even if people do not respond to the same items, for instance when using CAT. The use of CAT has great advantages compared to more traditional paper questionnaires; CATs are tailored to the patients' ability and therefore more efficient and precise [55,56].
The analyses of the IRT assumptions show that the required assumptions of unidimensionality and monotonicity are met, but there is some local dependence. The CFA results, CFI as well the TLI supported unidimensionality. The RMSEA was beyond the criterion of <0.06, but RMSEA values tend to be elevated when the number of items is large [57]. Furthermore, Miles and Shevlin (2007) indicate that the RMSEA alone is not meaningful; you have to consider the fit indices (CFI, TLI and RMSEA) as a whole, together with the sample size and the reliability of the measurement to determine the model fit [58]. Therefore, given the high CFI and TLI, the large sample size and high reliability of the item bank, the high RMSEA in the current study is of little concern. The results of local dependence suggest that a certain amount of local dependence is present. This could possibly influence the T-scores computed with a CAT. However, these local dependence results are only based on the analyses of the Dutch AMS-PAIN sample, and before we make decisions on removing items from the Dutch-Flemish PROMIS Pain Interference item bank, we need to reproduce the analyses in a Dutch general population sample. Therefore, it was decided not to remove items from the item bank at this moment. Until we reproduce the analyses in a Dutch general population sample, we can prevent that items that show local dependence are both being administered in a CAT. The calibration analyses of the Dutch-Flemish PROMIS Pain Interference items, show that the range of the item threshold parameters indicates good coverage across the range of the pain Interference construct. Furthermore the item threshold parameters show which items are most useful for measuring different levels of Pain Interference, which is required for the selection process of items in a CAT.
No items were flagged for DIF with respect to gender, age and administration mode. Therefore, the Dutch-Flemish PROMIS Pain Interference items and scores can be used across patients that differ in gender and age, and differ in the way of completing the item bank (digital or paper).
Although the response rate in this study was only 40.6%, the large sample size of 1085 patients is reassuring. When comparing the Dutch AMS-PAIN sample with the US ACPA sample, no differences were found in age, gender and pain intensity. However, the differences in educational level are noteworthy, where the US ACPA patients were more educated than the Dutch AMS-PAIN patients (97% vs 82% reporting high school education or more).
The evaluation of cross-cultural validity of the Dutch-Flemish PROMIS Pain Interference item bank versus the US PROMIS Pain Interference item bank identified evidence of DIF for language across 2 out of the 40 items. However, DIF had a minimal impact on the item scores. Therefore we conclude that the cross-cultural item differences were negligible and that all items can be retained in the item bank. For both items showing DIF there are some potential translational improvements. Therefore, we recommend testing new (possibly better) translations of these two items in a future data collection.
The plot of the standard errors across the range of the Dutch-Flemish PROMIS Pain Interference T-scores shows that the 8-item short form performs slightly better than the 8-item CAT at very low levels of pain interference (T-score <37). This could possibly be explained by the item selection procedure used in the simulated CAT [59]. For estimating person's T-score, the CAT starts in the middle of the trait-range (T-score = 50) asking the item with the highest information [59]. Because of this item selection procedure, possibly an 8-item CAT is too short to provide an accurate estimate of low T-scores [59]. However, the impact of this will only be minimal because there are almost no respondents with such low levels (T-score<37) of pain interference. Furthermore, the difference will disappear when using the CAT stopping rule of SE<0.3 (more commonly used method) instead of using a fixed number of items. Another explanation could be that the short forms include items covering the whole construct, where a CAT doesn't because the items chosen in a CAT depend on the persons' level of the construct. In this study, most patients were located at the higher level of the pain interference construct, through which items at the lower end of the pain interference construct have a lower probability of being administered in the simulated CAT This study supports the construct validity of the Dutch-Flemish PROMIS Pain Interference item bank, in which the correlations between the Dutch-Flemish PROMIS Pain Interference item bank and the legacy instruments were high, as expected.
The Dutch-Flemish PROMIS Pain Interference item bank is ready to be used as an item bank or short form. A 4-item PROMIS Pain Interference short form was developed within PROMIS (v1.0.4a), including items with the highest information value. Furthermore, a 6-item (v1.0.6b) and an 8-item (v1.08a) PROMIS Pain Interference short form were developed within PROMIS. When selecting Dutch-Flemish PROMIS Pain Interference items for short forms, it would be preferable to select items without DIF. Fortunately, the two items showing DIF for language are not included in the PROMIS Pain Interference short forms.
The Dutch-Flemish PROMIS Pain Interference item bank is now calibrated in Dutch persons with chronic pain and ready for use. For the time being, we recommend to use US PRO-MIS Pain Interference item parameters and the US T-score metric, with T = 50 as mean Tscore of the general US population as reference-point and on which the Dutch chronic pain sample is anchored with a mean T-score of 64.1. We recommend future analyses on data collected with the Dutch-Flemish PROMIS Pain Interference item bank in the general Dutch and Flemish population, and in patient groups with other health problems resulting in (chronic) pain. After data collection in the general Dutch and Flemish population the item bank needs to be recalibrated, and then a Dutch-Flemish T-score metric can be developed with a T = 50 as mean T-score of the general Dutch-Flemish population as reference-point. Also, it should then be decided if Dutch-Flemish specific item parameters are needed or whether the US item parameters can also be used in Dutch-Flemish patients. Furthermore, for future research it would be interesting to study DIF for other factors than age, gender, administration mode and language (e.g. medical diagnosis). It also would be interesting to evaluate the impact of DIF on the Dutch-Flemish PROMIS Pain Interference scores obtained by CAT, by comparing a CAT applying the Dutch-Flemish item parameters with a CAT applying the original US item parameters. The impact of DIF may be greater when using CAT as compared to using the total item bank, because a CAT uses only a small item set [10]. Another important step for future research and also for implementing the Dutch-Flemish PROMIS Pain Interference item bank, short forms and CAT, is to further improve the interpretability of the PROMIS metric. For example, the bookmarking method methodology, adapted from educational testing, could be used to develop cut scores for clinically meaningful category intervals [60]. Other methods should be applied to identify PROMIS Pain Interference score differences that represent minimal important changes [60].
In conclusion, this item calibration study found good cross-cultural and construct validity of the Dutch-Flemish PROMIS Pain Interference item bank. The item bank has the potential to improve the measurement of pain interference. The Dutch-Flemish PROMIS Pain Interference item bank and short forms are now available for clinical application in Dutch speaking persons with chronic pain and a Dutch-Flemish PROMIS Pain Interference CAT can now be developed, for the time being using US PROMIS Pain Interference item parameters.