Figures
Abstract
Objective
To examine the test-retest reliability and minimal detectable change (MDC) of the Dutch-Flemish Patient Reported Outcomes Measurement Information System (DF-PROMIS) Pain Interference (PI) v1.1, Physical Function (PF) v1.2, and Upper Extremity (UE) v2.0 computerized adaptive tests (CATs) in patients with musculoskeletal conditions receiving physical therapy in primary care.
Methods
Patients with musculoskeletal conditions of the spine or upper extremity were recruited from fourteen physical therapy practices. Participants completed DF-PROMIS CATs at baseline and again three to fourteen days later. Test-retest reliability was evaluated using the intraclass correlation coefficient (ICC) (two-way random effects, absolute agreement) and minimal detectable change (MDC). Reliability at the participant-level was visually represented by plotting test-retest scores with corresponding 95% confidence intervals (CIs).
Results
Data from 225 patients were analyzed. The DF-PROMIS CATs demonstrated sufficient test-retest reliability, with ICC values ranging from 0.79 to 0.91. MDC values ranged from 4.80 to 6.08 across all measurements. Participant-level reliability was high (0.9–0.95) for most measurements but lower for scores further from the mean. The 95% CIs for test-retest measurements overlapped in 95.3% of measurement pairs.
Conclusion
The DF-PROMIS PF, UE, and PI domain CATs demonstrated sufficient reliability and precision in patients with musculoskeletal conditions receiving physical therapy in primary care practices. Future research should focus on implementing DF-PROMIS CATs in clinical practice, examining their responsiveness, and evaluating their feasibility. Adoption of DF-PROMIS domains as outcomes in intervention studies and clinical practice will enhance interpretability and comparability of results across different patient groups.
Citation: Arensman RM, Haan EJA, Terwee CB, van Rosmalen J, Wittink H, Kiers H (2025) Test-retest reliability and minimal detectable change of Dutch-Flemish Patient Reported Outcomes Measurement Information System (PROMIS®) Computerized Adaptive Tests for musculoskeletal disorders. PLoS One 20(10): e0333670. https://doi.org/10.1371/journal.pone.0333670
Editor: Mark Hwang, The University of Texas Health Science Center at Houston / McGovern Medical School, UNITED STATES OF AMERICA
Received: May 28, 2025; Accepted: September 17, 2025; Published: October 10, 2025
Copyright: © 2025 Arensman et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All files (dataset, scripts, metadata) used for the analysis are available from https://doi.org/10.34894/6ABGM0.
Funding: This study is co-funded by the Taskforce for Applied Research SIA (RAAK.MKB13.025), part of the Dutch Research Council (NWO).
Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: author C.B. Terwee, PhD, is past board member of the PROMIS Health Organization and representative of the Dutch-Flemish PROMIS National Center.
Introduction
Musculoskeletal conditions are the leading contributor to disability worldwide [1,2] and are the highest contributor to the global need for rehabilitation [3]. Rehabilitation can potentially reduce the enormous financial costs related to disability due to musculoskeletal conditions [4] and is often provided by physical therapists. In the Netherlands, physical therapy is provided mainly in primary care physical therapy practices by approximately 21.600 therapists [5]. The most common conditions treated by physical therapists in primary care physical therapy practices in The Netherlands are musculoskeletal disorders of the spine and upper extremity [6].
The main purpose of physical therapy treatment for patients with musculoskeletal conditions is to provide patient-centered care within a biopsychosocial model tailored to the individual’s personal goals and preferences [7]. To this end, physical therapy treatments aim to improve physical function or to reduce the interference of pain with daily functioning in patients. Patient Reported Outcome Measures (PROMs) help to operationalize these principles by systematically capturing the patient’s own perspective on their symptoms, functioning, and quality of life, thereby ensuring that biological, psychological, and social dimensions of health are included in clinical decision-making [8]. In this way, PROMs support physical therapists in their clinical reasoning related to diagnosis, treatment, and evaluation [8]. A recent systematic review found evidence that feedback from PROMs can improve quality of life, patient-provider communication, and disease control [9]. However, physical therapists find it difficult to select appropriate PROMs for each patient, due to the large number of PROMs available [10], and lack of knowledge regarding the assessment of the quality of available instruments [11]. Despite attempts to support PROM selection by clinicians [12,13], physical therapists indicated the need for a core set of PROMs with clear instructions regarding their application, scoring, and interpretation [11].
In the Netherlands, the Dutch-Flemish translation of the Patient Reported Outcomes Measurement Information System (PROMIS) developed by a US consortium of research groups, funded by the National Institutes of Health (NIH) [14,15] was recommended to standardize measurement of PROMs across different patient groups [16]. PROMIS instruments were selected because they provide precise and efficient assessment of relevant health outcomes, with standardized scores that enable comparability across studies and patient populations [14,17]. Furthermore, PROMIS is a valid and reliable measurement system, which was developed using Item Response Theory (IRT) [18]. PROMIS consists of a collection of IRT-based item banks for relevant domains in healthcare, such as physical function (PF), pain interference (PI), and fatigue [15]. From these item banks, Computerized Adaptive Tests (CATs) [19] were developed, which select the most informative questions for each individual based on their responses. This results in patient-specific assessments that require fewer items to complete, while achieving equal or greater measurement precision compared to traditional fixed item questionnaires [14,20]. PROMIS scores are reported as T-scores, standardized to a reference population with a mean of 50 and a standard deviation of 10 [18]. This allows patients’ outcomes to be interpreted relative to the general population and facilitates comparisons across different patient groups using a common metric.
Several PROMIS item banks relevant to use for the evaluation of treatment effects in patients with musculoskeletal conditions receiving physical therapy have been validated in clinical samples [15]. These item banks are the Dutch-Flemish PROMIS Pain Interference (DF-PROMIS-PI) v1.1 item bank (patients with chronic pain) [21], the Dutch-Flemish PROMIS Physical Function (DF-PROMIS-PF) v1.2 item bank (patients with chronic pain or receiving physical therapy) [22,23], and the Dutch-Flemish PROMIS Upper Extremity (DF-PROMIS-UE) v2.0 item bank (patients with UE disorders) [24–26]. All item banks have shown sufficient construct validity and are now available as CAT.
Although validity of the DF-PROMIS CAT instruments has been demonstrated in clinical samples, validity alone is insufficient to justify their routine use by clinicians. For practical implementation in primary care physical therapy, test-retest reliability and minimal detectable change (MDC) are equally important. Information on the test-retest reliability and MDC of the DF-PROMIS CAT instruments is important, as low reliability or large MDCs introduce uncertainty in the scores obtained and, consequently, in the clinical decision-making for which these instruments are used [27]. Test-retest reliability refers to “the extent to which scores for patients who have not changed are the same for repeated measurement … over time” [28], while MDC represents the smallest change detectable beyond measurement error with predefined confidence [29]. Unlike classical test theory (CTT) models that produce single estimates of reliability and MDC dependent on sample characteristics, the Item Response Theory (IRT)-based DF-PROMIS CATs allow estimation of reliability and MDC at the individual patient level [30]. Clinicians thus gain the opportunity to make personalized, reliable decisions regarding patient progress and treatment effectiveness.
Therefore, the aim of the current study was to examine the test-retest reliability and MDC of the DF-PROMIS-PI v1.1, the DF-PROMIS-PF v1.2, and the DF-PROMIS-UE v2.0 in patients with musculoskeletal conditions receiving physical therapy in primary care practices.
Methods
Design and setting
This study was an observational cohort study. The study was performed in fourteen primary care physical therapy practices in The Netherlands. The physical therapy practices had to use the electronic health record system Fysiomanager© (Heerenveen, The Netherlands) which has an application programming interface with the Dutch-Flemish PROMIS National Center, that allows integration for administering PROMIS CATs. All participating physical therapist participated in an online training introducing the PROMIS CAT instruments and study protocol and received an instruction manual for reference. The Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were used in the reporting of this work [31].
Ethics statement
The study was reviewed by the Ethical Committee Research Healthcare Domain (ECO-GD) from the HU University of Applied Sciences Utrecht (150-000-2021) which declared this research project not subject to the Dutch Medical Research Involving Human Subjects Act and that the protocol was in accordance with the Dutch Personal Data Protection Act. The study was conducted in accordance with the Declaration of Helsinki and written informed consent was obtained from all participants by checking a box in the first digital questionnaire in a patient portal of Fysiomanager© prior to or during the first physical therapy consultation.
Participants
Patients were recruited between July 9th, 2021, and April 30th, 2023. Patients were eligible for the study if (1) they contacted the physical therapist seeking treatment for musculoskeletal disorders of the spine or the upper extremity; and (2) were aged ≥ 18 years. Exclusion criteria were: 1) not being able to complete questionnaires; or 2) insufficient command of the Dutch language. Patients with musculoskeletal disorders of the spine or upper extremity who contacted the participating physical therapy practices were informed about the study by the physical therapist. Patients who were interested in participating were sent an information letter about the study by email and after providing informed consent, completed the first digital questionnaire in a patient portal of Fysiomanager© prior to or during the first physical therapy consultation. Patients that were not able use the online patient portal, could complete the questionnaires and the CATs during their consultations on the computer of the physical therapist. After informed consent was obtained, the physical therapist assessed eligibility for participation and informed the researchers. Due to the observational design of the study, all patients received usual physical therapy treatment.
Sample size
The required sample size was based on the recommendations from the COSMIN-checklist for assessing the methodological quality of studies on measurement properties of PROMs [32]. The aim was to recruit patients until at least 100 patients completed the measurements at both time points for each of the DF-PROMIS CATs.
Measures
Demographic and clinical characteristics, including age, sex, a registration code for the type of disorder, and start and end date of the physical therapy treatment series, were collected from the electronic health records. Duration of the complaint at baseline, type of disorder, educational level, and employment status were self-reported at baseline.
The PROMs used in this study were the DF-PROMIS-PI v1.1, the DF-PROMIS-PF v1.2, the DF-PROMIS-UE v2.0, and an anchor question for each of the PROMIS CAT domains. The DF-PROMIS CATs are based on IRT, and were modeled using a Graded Response Model, a generalization of the 2-parameter logistic model for dichotomous response data [18]. Each DF-PROMIS CAT begins with an item calibrated near the mean of the item bank scale, and the patient’s response is used to estimate a T-score and its corresponding standard error (SE(T-score)). Subsequent most informative items are then selected iteratively based on this estimate, and the process continues until a pre-specified stopping rule is met. In this study, the stopping rules were either an SE(T-score) < 2.24 (corresponding to a reliability of approximately 0.95) with a minimum of four completed items, or a maximum of twelve items. The resulting individual scale scores are expressed as a T-score, calibrated to a mean of 50, with an SD of 10 based on the US general population [14]. This allows the patient’s score to be interpreted relative to the general population, and country-specific reference values are available to assist interpretation [33]. For the items in the item banks underlying the CATs, there are three different 5-point Likert response scales: 1) unable to do/with much difficulty/with some difficulty/with a little difficulty/without any difficulty; 2) cannot do/quite a lot/somewhat/very little/not at all; and 3) cannot do because of health/a lot of difficulty/some difficulty/a little bit of difficulty/no difficulty at all. The DF-PROMIS CAT instruments do not allow missing responses.
The DF-PROMIS-PI v1.1 underlying item bank has shown good cross-cultural and construct validity [21]. The item bank contains 40 items covering a wide range of pain interference content. The time frame the patient is asked about is the past 7 days. Higher scores indicate more pain interference.
The DF-PROMIS-PF v1.2 item bank was validated in a sample of Dutch patients with chronic pain [23] and in a Dutch sample of patients receiving physical therapy [22] and showed sufficient to good psychometric properties. The item bank contains 121 items, which cover a wide range of activities, from self-care (activities of daily living) to more complex activities that require a combination of skills. The item bank includes items about functioning of the axial regions (neck and back), the upper and lower extremities, and ability to carry out instrumental activities of daily living (e.g., household chores or shopping). There is no time frame set for the items, but current status is inferred. Higher scores indicate better function.
The DF-PROMIS-UE v2.0 item bank was validated in Dutch samples of patients with upper extremity disorders, and showed sufficient psychometric properties [24–26]. The DF-PROMIS-UE v2.0 item bank contains 46 items addressing upper extremity function. Higher scores indicate better function.
The DF-PROMIS CATs were completed at baseline and a second time between three to a maximum of fourteen days after baseline measurement. This timeframe was chosen to be “long enough to prevent recall, and short enough to ensure that patients remain stable” [34], in a population of patients whose health status can change quickly for acute complaints even without treatment [35,36]. To ensure stability in the domains of interest, the patient’s perceived change was assessed using an anchor question.
The anchor question related to the DF-PROMIS CAT domains were completed at the second measurement of the DF-PROMIS CATs. The anchor question for each of the DF-PROMIS CAT domains is a single item asking “To what extent do you think your pain interference/physical function/upper extremity function has changed since the start of physical therapy treatment?” with seven response options: 1) very much improved; 2) much improved; 3) somewhat improved; 4) unchanged; 5) somewhat deteriorated; 6) much deteriorated; 7) very much deteriorated.
Each patient only completed DF-PROMIS CATs and corresponding anchor questions relevant to their musculoskeletal disorder. Patients with back or neck pain completed the DF-PROMIS-PF v1.2 CAT, patients with UE disorders completed the DF-PROMIS-UE v2.0 CAT, and all patients completed the DF-PROMIS-PI v1.1 CAT.
Data analysis
Data was analyzed using R Statistical Software (v4.3.2; R Core Team 2023) and the ‘psych’ package [37]. Descriptive statistics were used to explore patient characteristics. Mean and SD are reported for normally distributed data, median and inter quartile range (IQR) for non-normally distributed data.
For the assessment of test-retest reliability and the corresponding MDC, patient data was included in the analysis when the retest measurement was completed between three to fourteen days after baseline measurement, and when the patient rated “unchanged” on the domain specific anchor question.
First, CTT-based methods were used to estimate test-retest reliability and the corresponding MDC of the DF-PROMIS CATs. To this end, the intraclass correlation coefficient (ICC) was calculated using the two-way random effects model for absolute agreement [38]:
with being the variation between participants,
being the variation between repeated measurements, and error variance
. The two-way random effects absolute agreement ICC and the corresponding 95% confidence intervals (95%CIs) were calculated for each DF-PROMIS CAT separately. An ICC value of ≥ 0.7 is rated “sufficient” [39,40]. The MDC was also calculated for each of the DF-PROMIS CATs and was based on IRT. Since the SE(T-score) varies between participants, the MDC was first calculated for the individual using the formula:
. The MDC for the group was then obtained by calculating the mean of the individual MDCs for each of the DF-PROMIS CATs.
Although IRT models provide information on reliability in terms of the precision of the T-score (i.e., the standard error of the T-score), IRT does not provide a default method for assessing test-retest reliability. Instead, the DF-PROMIS CAT results were used to calculate participant-level reliability and MDC for the first measurement of each of the DF-PROMIS CAT instruments. In addition, the T-scores for individual participants with corresponding 95%CI at both time points were presented graphically to provide information on test-retest reliability.
For participant-level reliability, the estimate of the T-score () for participant p from one of the DF-PROMIS CAT instruments is represented by
with variance
.
represents the squared SE(T-score), based on the IRT model. Reliability at the participant level can then be defined and calculated as [30]:
Because the goal is to interpret a participant’s T-score relative to the reference population, the variance of the T-score of the reference population () can be used in the calculation instead of an estimate of participant variance
. For the DF-PROMIS CAT instruments, the following variances from the Dutch population were used: DF-PROMIS-PF
, DF-PROMIS-UE
, and DF-PROMIS-PI
[33].
Finally, the MDC with 95% confidence at the participant level (pMDC) was calculated for the first measurement using [29]:
Results
Participants
A total of 1141 patients seeking treatment for musculoskeletal disorders of the spine or the upper extremity contacted the fourteen participating physical therapy practices during the study and were asked to participate. Of these patients, 67.8% (774) agreed to participate, of which 14 patients were excluded based on the exclusion criteria. Data were available for test-retest reliability analysis from 225 of the included patients: 106 completed the DF-PROMIS-PF v1.2, 93 completed the DF-PROMIS-UE v2.0, and 181 completed the DF-PROMIS-PI v1.1 (Fig 1).
The patients included in the study were mostly female (67.6%) and mean age was 47.4 (SD 15.8). Almost half of the included patients (47.3%) experienced chronic complaints. Additional demographic characteristics of the participants are shown in Table 1.
Test-retest reliability
The group-level measures of reliability of the DF-PROMIS CATs are shown in Table 2. The DF-PROMIS CATs showed sufficient group level test-retest reliability (ICC agreement between 0.79 and 0.86). Median SE(T-score)s of the DF-PROMIS CATs ranged from 1.8 (PI domain) to 2.1 (UE domain), resulting in group-level MDC scores ranging from 4.80 to 6.08 across the respective domains..
Reliability at the participant level is plotted against the corresponding T-score for the PF-PROMIS-CAT domains in Fig 2 and shows the variability of reliability at the participant level across the observed T-scores. For observed scale scores exceeding approximately 0.5 SD above the mean (or fall below approximately 0.5 SD below the mean in the case of the PI domain) reliability was below the 0.9 reliability threshold. Similarly, Fig 3 depicts the participant level MDC plotted against the observed T-scores, indicating higher MDC values for observed T-scores above the mean for the UE domain and higher MDC values for observed T-scores below the mean for the PI domain. Fig 4 displays the SE(T-score) plotted against the observed T-score, providing a clear visual representation of the impact of the stopping rules employed in the DF-PROMIS CATs. The data points on the horizontal “ceiling line” represent the stopping rule of reaching an SE(T-score) of 2.24 at or after completion of the minimum required four items, but before the twelve-item limit was reached. Data points below this “ceiling line” represent participants who achieved an SE(T-score) of 2.24 or less before completing the required minimum of four items, necessitating the completion of additional items to meet the minimum required amount, which reduced the SE(T-score) further. Data points above the 2.24 “ceiling line” represent participants whose assessments were stopped because the twelve-item limit stopping rule was reached. A visual representation of test-retest reliability across the scale score range of the DF-PROMIS CAT instruments can be seen in Fig 5. The figure shows that the 95%CI of the test and retest measurements did not overlap for four, three, and eleven participants for the DF-PROMIS-PF CAT, DF-PROMIS-UE CAT, and DF-PROMIS-PI CAT respectively, which amounts to 4.7% of all observed test-retest measurements.
SE(t-score) of 3.16 or lower corresponds with reliability 0.9 or higher.
Discussion
This study examined the reliability and MDC of the DF-PROMIS-PI v1.1, the DF-PROMIS-PF v1.2, and the DF-PROMIS-UE v2.0 in patients with musculoskeletal conditions receiving physical therapy in primary care practices. The results found in this study demonstrate that the test-retest reliability of all the DF-PROMIS CATs investigated exceed the threshold for “sufficient” [39,40].
The reliability at the participant level exceeded 0.9 for all but five of the DF-PROMIS CAT measurements, suggesting that the DF-PROMIS CATs are sufficiently precise for clinical decision-making regarding individual patients [41]. However, participant-level reliability for the UE and PI domains may fall below the 0.9 threshold as individuals’ T-scores deviate further from the population mean (increasing above it for the UE domain (high level of functioning) or decreasing below it for the PI domain (low level of PI)). In practical terms, reduced precision in these instances typically does not present a problem. Likely outcomes in such scenarios include concluding that no further treatment is necessary to improve the DF-PROMIS CAT domain or that existing treatment has been effective and can be concluded. The high precision of the DF-PROMIS CATs is also reflected in the participant level MDCs (Fig 3) and group level MDCs ranging from 4.80 to 6.08.
The observed test-retest reliability outcomes at the group level align with findings from previous research investigating DF-PROMIS CATs. The DF-PROMIS-PF v1.2 CAT demonstrated sufficient test-retest reliability (ICC 0.92), mean SE(T-score) (2.06), and MDC (5.72) in Dutch patients with chronic kidney disease [42], which are comparable to those in the current study. Although no data were presented on participant-level reliability, the authors utilized the same formula to calculate the IRT-based group-level MDC. Given the mean SE(T-score) provided, it can be inferred that participant-level reliability is likely very similar to the findings of this study. Comparable results have also been reported in Dutch patients with chronic pain [23] and those receiving physical therapy [22], where similarly low SE(T-score)s (≤ 2.0) were observed.
The DF-PROMIS-UE v2.0 CAT was previously evaluated in a Dutch patients with UE disorders, where slightly higher SE(T-scores) were observed than those in the present study [26]. The patients in the current study had a higher average mean T-score of 37.2 compared to the earlier cohort (range 33.4–34.7), which might account for this difference. However, the difference is relatively minor and is unlikely to influence clinical decisions at the individual level. Wilkinson et al. (2021) reported an ICC of 0.82 with an 83% confidence interval (CI) ranging from 0.77 to 0.86, slightly lower than the ICC of 0.91 with a 95% CI ranging from 0.86 to 0.94 found in the current study [43]. Despite the different thresholds used for calculating the CIs, their similarity further supports consistency in the ICCs reported.
Participant-level reliability of the DF-PROMIS-PI v1.1 CAT has shown to be > 0.95 in Dutch patients with chronic pain [21], Dutch patients with Rheumatoid Arthritis [44], and Dutch patients with chronic kidney disease [42] with T-scores ranging from 50 to 80. These observations are almost identical to the findings in the current study. Mean T-scores are also very similar between the different patient groups investigated, with the exception of the Dutch patients with chronic pain with a mean T-score of 64.1 [21].
Collectively, the DF-PROMIS PF, UE, and PI domain CATs consistently demonstrate sufficient reliability and precision for measuring their respective domains in both clinical practice and research settings across different patient groups. [45].. The consistency of precision across different populations underscores a significant advantage of IRT-based PROMs over CTT-based PROMS, namely that the precision of T-scores is an inherent characteristic of the instrument, independent of the sample.
The DF-PROMIS PF, UE, and PI domain CATs also show consistent MDCs at both the group and participant level across patient populations, as shown by the similar SE(T-scores). This similarity is unsurprising, since the SE(T-score) depends on the location of the T-score across the scale of the CAT, with higher SE(T-scores) at the extremes and the lowest SE(T-scores) around the mean of the scale [46]. Therefore, patient groups with similar average ability (T-score) in a DF-PROMIS domain have similar mean SE(T-score)s and consequently similar MDC values.
The test-retest reliability of the DF-PROMIS CATs examined in this study, assessed using CTT-based methods, also demonstrates sufficient reliability across all three domains, with ICC scores ranging from 0.79 to 0.91. These findings align with reliability results from domain-specific PROMs. For example, the Quebec Back Pain Disability Scale [47] (QBPDS) as a measure of physical function in patients with lower back pain (LBP) was found to have sufficient reliability, ranging from 0.70 to 0.99 [48]. Similarly, the Neck Disability Index [49] (NDI), a measure of physical function in patients with neck pain, showed a pooled ICC of 0.91 [50]. Although the latter is higher, it is similar to the ICC of 0.86 found in this study for physical function. Test-retest reliability based on CTT methods for the DF-PROMIS UE domain CAT can be compared with the findings from the Disabilities of the Arm, Shoulder, and Hand [51] (DASH) questionnaire. Reliability of the DASH has been shown to range from 0.91 to 0.96 in patients with shoulder disorders [52]. These findings are again very similar to the ICC of 0.91 for the DF-PROMIS-UE v2.0 CAT found in the current study.
For clinical care, the observed reliability and low MDCs indicate that clinicians can trust PROMIS CAT scores to be precise and reproducible across diverse musculoskeletal populations. This supports their use for monitoring outcomes and guiding treatment decisions at both the individual and group level. Because DF-PROMIS CATs adaptively administer only the most informative items, they reduce patient burden while maintaining measurement precision, making them feasible for integration into routine primary care physical therapy. Furthermore, the standardized T-score metric facilitates comparisons across conditions and patient groups, allowing clinicians to interpret outcomes consistently in heterogeneous patient populations. Together, these properties strengthen the case for wider implementation of PROMIS CATs in physical therapy where efficient, low-burden, and clinically meaningful outcome measurement is needed [53].
Strengths and limitations
Several strengths and limitations must be considered when interpreting the results of this study. A notable strength is the large sample sizes for all three DF-PROMIS domains [32]. Meeting these recommendations enhances confidence in the results and strengthens the methodological quality of the study. Although the threshold of 100 patients was not reached for the UE domain (93 patients), the impact on the generalizability of the results is likely minor. Additionally, the study’s design minimized the effort required from physical therapists for patient recruitment and data collection, and combined with broad eligibility criteria, ensured that the included patients closely reflect the average patient treated by physical therapists. Furthermore, the inclusion of an anchor question asking patients to report changes on the DF-PROMIS domains ensured that only data from patients without changes were included in the analysis.
A limitation to consider is the large number of patients excluded from the analysis due to reporting change on the anchor questions, exceeding the maximum number of days between the two measurements, or missing data for the baseline or follow-up measurement (535 patients). This exclusion necessitated the recruitment of additional patients to meet the sample size requirements, leading to a waste of time and resources, and might have introduced bias. Another potential source of bias is the significant number of patients who declined to participate. To keep data collection feasible and minimize the burden on participants, reasons for non-participation were not recorded, leaving it unknown if and how much bias was introduced. The final limitation is that CATs require an electronic device, such as a smartphone, tablet, or personal computer, for completion. This requirement might exclude patients lacking digital skills from completing the DF-PROMIS CATs. To mitigate this issue, patients could complete the measurements during a treatment session with their physical therapist, although this could potentially lead to biased responses. Unfortunately, data collection did not include whether the patient completed the measurements alone or with support from their physical therapist.
A final important consideration is that the DF-PROMIS-PI uses a 7-day recall period, whereas the DF-PROMIS-PF and DF-PROMIS-UE capture current function. For the DF-PROMIS-PI, a 3-day retest interval may lead to partial overlap of recall periods, potentially inflating agreement. In contrast, DF-PROMIS-PF and DF-PROMIS-UE scores reflect the patient’s status at the time of measurement, making their reliability more directly dependent on short-term stability. Clinically, this implies that when follow-up measurements with the DF-PROMIS-PI are planned within seven days, the recall period would need to be adjusted to prevent overlap; however, any modification of the recall period requires prior approval from HealthMeasures or the DF-PROMIS National Center.
The DF-PROMIS CATs require fewer items to complete, are more accurate than legacy instruments, allow comparison between different patient groups relative to the reference population on a domain, and adapt to each individual patient based on their responses to the items in the DF-PROMIS CAT. Future research should focus on implementing the DF-PROMIS CATs in clinical practice by investigating their responsiveness and clinical feasibility. Researchers should consider adopting the DF-PROMIS domains as outcomes for intervention studies to facilitate interpretability and comparison of results between different patient groups.
Conclusion
The DF-PROMIS PF, UE, and PI CATs demonstrated sufficient reliability and precision at both the group-level and participant-level in patients with musculoskeletal conditions receiving physical therapy in primary care practices. Future research should focus on implementing DF-PROMIS CATs in clinical practice, examining their responsiveness, and evaluate their feasibility. Emphasizing the adoption of DF-PROMIS domains as outcomes in intervention studies and clinical practice will enhance the interpretability and comparability of results across different patient groups, thereby potentially improving patient care and outcomes in physical therapy.
References
- 1. GBD 2019 Diseases and Injuries Collaborators. Global burden of 369 diseases and injuries in 204 countries and territories, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–22. pmid:33069326
- 2. Liu S, Wang B, Fan S, Wang Y, Zhan Y, Ye D. Global burden of musculoskeletal disorders and attributable factors in 204 countries and territories: a secondary analysis of the Global Burden of Disease 2019 study. BMJ Open. 2022;12(6):e062183. pmid:35768100
- 3. Cieza A, Causey K, Kamenov K, Hanson SW, Chatterji S, Vos T. Global estimates of the need for rehabilitation based on the Global Burden of Disease study 2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2021;396(10267):2006–17. pmid:33275908
- 4.
United States Bone and Joint Initiative. The burden of musculoskeletal diseases in the United States: Prevalence, Societal and Economic Cost. 4th ed. Rosemont, IL: United States Bone and Joint Initiative. 2018. Available from: www.boneandjointburden.org
- 5.
Medisch geschoolden; specialisme, arbeidspositie, sector, leeftijd. In: CBS [Internet]. 2024 [cited 2024 Mar 4]. Available from: https://opendata.cbs.nl/statline/#/CBS/nl/dataset/84776NED/table?dl=8FB5C
- 6.
Veldkamp R, Kruisselbrink M, Meijer W. Zorgregistraties Eerste Lijn - Zorg door de fysiotherapeut; jaarcijfers 2020 en trendcijfers 2017 - 2020. Utrecht: Nivel; 2022. Available from: https://www.nivel.nl/nl/nivel-zorgregistraties-eerste-lijn/nivel-zorgregistraties-eerste-lijn
- 7. Lin I, Wiles L, Waller R, Goucke R, Nagree Y, Gibberd M, et al. What does best practice care for musculoskeletal pain look like? Eleven consistent recommendations from high-quality clinical practice guidelines: systematic review. Br J Sports Med. 2020;54(2):79–86. pmid:30826805
- 8. Kyte DG, Calvert M, van der Wees PJ, ten Hove R, Tolan S, Hill JC. An introduction to patient-reported outcome measures (PROMs) in physiotherapy. Physiotherapy. 2015;101(2):119–25. pmid:25620440
- 9. Gibbons C, Porter I, Gonçalves-Bradley DC, Stoilov S, Ricci-Cabello I, Tsangaris E, et al. Routine provision of feedback from patient-reported outcome measurements to healthcare providers and patients in clinical practice. Cochrane Database Syst Rev. 2021;10(10):CD011589. pmid:34637526
- 10. Fennelly O, Blake C, Desmeules F, Stokes D, Cunningham C. Patient-reported outcome measures in advanced musculoskeletal physiotherapy practice: a systematic review. Musculoskeletal Care. 2018;16(1):188–208. pmid:28660673
- 11. Swinkels RAHM, van Peppen RPS, Wittink H, Custers JWH, Beurskens AJHM. Current use and barriers and facilitators for implementation of standardised measures in physical therapy in the Netherlands. BMC Musculoskelet Disord. 2011;12:106. pmid:21600045
- 12. Fleischmann M, Vaughan B. The challenges and opportunities of using patient reported outcome measures (PROMs) in clinical practice. Int J Osteopath Med. 2018;28:56–61.
- 13. Chiarotto A. Patient-reported outcome measures: best is the enemy of good (But What if Good Is Not Good Enough?). J Orthop Sports Phys Ther. 2019;49(2):39–42. pmid:30704358
- 14. Cella D, Riley W, Stone A, Rothrock N, Reeve B, Yount S, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005-2008. J Clin Epidemiol. 2010;63(11):1179–94. pmid:20685078
- 15. Terwee CB, Roorda LD, de Vet HCW, Dekker J, Westhovens R, van Leeuwen J, et al. Dutch-Flemish translation of 17 item banks from the Patient-Reported Outcomes Measurement Information System (PROMIS). Qual Life Res. 2014;23(6):1733–41. pmid:24402179
- 16. Oude Voshaar M, Terwee CB, Haverman L, van der Kolk B, Harkes M, van Woerden CS, et al. Development of a standard set of PROs and generic PROMs for Dutch medical specialist care : recommendations from the Outcome-Based Healthcare Program Working Group Generic PROMs. Qual Life Res. 2023;32(6):1595–605. pmid:36757571
- 17.
HealthMeasures. Intro to PROMIS®. 2025 [cited 23 Aug 2025]. Available from: https://www.healthmeasures.net/explore-measurement-systems/promis/intro-to-promis
- 18. Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007;45(5 Suppl 1):S22-31. pmid:17443115
- 19. Meijer RR, Nering ML. Computerized adaptive testing: overview and introduction. Appl Psychol Meas. 1999;23(3):187–94.
- 20.
HealthMeasures. PROMIS® (Patient-Reported Outcomes Measurement Information System). 27 Mar 2023 [cited 2024 Jan 4]. Available from: https://www.healthmeasures.net/explore-measurement-systems/promis
- 21. Crins MHP, Roorda LD, Smits N, de Vet HCW, Westhovens R, Cella D, et al. Calibration and validation of the Dutch-Flemish PROMIS pain interference item bank in patients with chronic pain. PLoS One. 2015;10(7):e0134094. pmid:26214178
- 22. Crins MHP, van der Wees PJ, Klausch T, van Dulmen SA, Roorda LD, Terwee CB. Psychometric properties of the PROMIS Physical Function item bank in patients receiving physical therapy. PLoS One. 2018;13(2):e0192187. pmid:29432433
- 23. Crins MHP, Terwee CB, Klausch T, Smits N, de Vet HCW, Westhovens R, et al. The Dutch-Flemish PROMIS Physical Function item bank exhibited strong psychometric properties in patients with chronic pain. J Clin Epidemiol. 2017;87:47–58. pmid:28363734
- 24. van Bruggen SGJ, Lameijer CM, Terwee CB. Structural validity and construct validity of the Dutch-Flemish PROMIS® physical function-upper extremity version 2.0 item bank in Dutch patients with upper extremity injuries. Disabil Rehabil. 2021;43(8):1176–84. pmid:31411908
- 25. Haan E-JA, Terwee CB, Van Wier MF, Willigenburg NW, Van Deurzen DFP, Pisters MF, et al. Translation, cross-cultural and construct validity of the Dutch-Flemish PROMIS® upper extremity item bank v2.0. Qual Life Res. 2020;29(4):1123–35. pmid:31894506
- 26. Lameijer CM, van Bruggen SGJ, Haan EJA, Van Deurzen DFP, Van der Elst K, Stouten V, et al. Graded response model fit, measurement invariance and (comparative) precision of the Dutch-Flemish PROMIS® Upper Extremity V2.0 item bank in patients with upper extremity disorders. BMC Musculoskelet Disord. 2020;21(1):170. pmid:32178644
- 27. Mokkink LB, Eekhout I, Boers M, van der Vleuten CPM, de Vet HCW. Studies on reliability and measurement error of measurements in medicine - From design to statistics explained for medical researchers. Patient Relat Outcome Meas. 2023;14:193–212. pmid:37448975
- 28. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63(7):737–45. pmid:20494804
- 29.
de Vet HCW, Terwee CB, Mokkink LB, Knol DL. Measurement in medicine: a practical guide. Cambridge University Press. 2011.
- 30. Raju NS, Price LR, Oshima TC, Nering ML. Standardized conditional SEM: a case for conditional reliability. Applied Psychological Measurement. 2007;31(3):169–80.
- 31. Kottner J, Audige L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Int J Nurs Stud. 2011;48(6):661–71. pmid:21514934
- 32. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539–49. pmid:20169472
- 33. Terwee CB, Roorda LD. Country-specific reference values for PROMIS® pain, physical function and participation measures compared to US reference values. Ann Med. 2023;55(1):1–11. pmid:36426680
- 34.
Mokkink LB, Prinsen CA, Patrick DL, Alonso J, de Vet HC, Terwee Caroline B. COSMIN Study Design checklist for Patient-reported outcome measurement instruments. Amsterdam; 2019 Jul. Available from: https://www.cosmin.nl/wp-content/uploads/COSMIN-study-designing-checklist_final.pdf
- 35. Vasseljen O, Woodhouse A, Bjørngaard JH, Leivseth L. Natural course of acute neck and low back pain in the general population: the HUNT study. Pain. 2013;154(8):1237–44. pmid:23664654
- 36. Wallwork SB, Braithwaite FA, O’Keeffe M, Travers MJ, Summers SJ, Lange B, et al. The clinical course of acute, subacute and persistent low back pain: a systematic review and meta-analysis. CMAJ. 2024;196(2):E29–46. pmid:38253366
- 37.
Revelle W. Psych: procedures for psychological, psychometric, and personality research. Evanston, Illinois: Northwestern University. 2024.
- 38. McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1(1):30–46.
- 39. Prinsen CAC, Vohra S, Rose MR, Boers M, Tugwell P, Clarke M, et al. How to select outcome measurement instruments for outcomes included in a “Core Outcome Set” - a practical guideline. Trials. 2016;17(1):449. pmid:27618914
- 40. de Vet HCW, Terwee CB, Mokkink LB, Knol DL. Reliability. In: de Vet HCW, Terwee CB, Mokkink LB, Knol DL, editors. Measurement in medicine: a practical guide. Cambridge: Cambridge University Press; 2011. p. 96–149.
- 41. Hahn EA, Cella D, Chassany O, Fairclough DL, Wong GY, Hays RD, et al. Precision of health-related quality-of-life data compared with other clinical measures. Mayo Clin Proc. 2007;82(10):1244–54. pmid:17908530
- 42. van der Willik EM, van Breda F, van Jaarsveld BC, van de Putte M, Jetten IW, Dekker FW, et al. Validity and reliability of the Patient-Reported Outcomes Measurement Information System (PROMIS®) using computerized adaptive testing in patients with advanced chronic kidney disease. Nephrol Dial Transplant. 2023;38(5):1158–69. pmid:35913734
- 43. Wilkinson JT, Clawson JW, Allen CM, Presson AP, Tyser AR, Kazmers NH. Reliability of telephone acquisition of the PROMIS upper extremity computer adaptive test. J Hand Surg Am. 2021;46(3):187–99. pmid:33243590
- 44. Crins MHP, Terwee CB, Westhovens R, van Schaardenburg D, Smits N, Joly J, et al. First validation of the full PROMIS pain interference and pain behavior item banks in patients with rheumatoid arthritis. Arthritis Care Res (Hoboken). 2020;72(11):1550–9. pmid:31562795
- 45.
Embretson SE, Reise SP. Item response theory for psychologists. Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers; 2000.
- 46. Jabrayilov R, Emons WHM, Sijtsma K. Comparison of classical test theory and item response theory in individual change assessment. Appl Psychol Meas. 2016;40(8):559–72. pmid:29881070
- 47. Kopec JA, Esdaile JM, Abrahamowicz M, Abenhaim L, Wood-Dauphinee S, Lamping DL, et al. The Quebec Back Pain Disability Scale: conceptualization and development. J Clin Epidemiol. 1996;49(2):151–61. pmid:8606316
- 48. Speksnijder CM, Koppenaal T, Knottnerus JA, Spigt M, Staal JB, Terwee CB. Measurement properties of the Quebec Back Pain Disability Scale in patients with nonspecific low back pain: systematic review. Phys Ther. 2016;96(11):1816–31. pmid:27231271
- 49. Vernon H, Mior S. The Neck Disability Index: a study of reliability and validity. J Manipulative Physiol Ther. 1991;14(7):409–15. pmid:1834753
- 50. Saltychev M, Pylkäs K, Karklins A, Juhola J. Psychometric properties of neck disability index - a systematic review and meta-analysis. Disabil Rehabil. 2024;46(23):5415–31. pmid:38240027
- 51. Hudak PL, Amadio PC, Bombardier C, Beaton D, Cole D, Davis A, et al. Development of an upper extremity outcome measure: The DASH (disabilities of the arm, shoulder, and head). Am J Ind Med. 1996;29(6):602–8.
- 52. Kolber MJ, Salamh PA, Hanney WJ, Samuel Cheng M. Clinimetric evaluation of the disabilities of the arm, shoulder, and hand (DASH) andQuickDASH questionnaires for patients with shoulder disorders. Phys Ther Rev. 2013;19(3):163–73.
- 53. Terwee CB, Ahmed S, Alhasani R, Alonso J, Bartlett SJ, Chaplin JE, et al. Comparable real-world patient-reported outcomes data across health conditions, settings, and countries: The PROMIS International Collaboration. NEJM Catalyst. 2024;5(9).