Further Optimization of the Reliability of the 28-Joint Disease Activity Score in Patients with Early Rheumatoid Arthritis

Background The 28-joint Disease Activity Score (DAS28) combines scores on a 28-tender and swollen joint count (TJC28 and SJC28), a patient-reported measure for general health (GH), and an inflammatory marker (either the erythrocyte sedimentation rate [ESR] or the C-reactive protein [CRP]) into a composite measure of disease activity in rheumatoid arthritis (RA). This study examined the reliability of the DAS28 in patients with early RA using principles from generalizability theory and evaluated whether it could be increased by adjusting individual DAS28 component weights. Methods Patients were drawn from the DREAM registry and classified into a “fast response” group (N = 466) and “slow response” group (N = 80), depending on their pace of reaching remission. Composite reliabilities of the DAS28-ESR and DAS28-CRP were determined with the individual components' reliability, weights, variances, error variances, correlations and covariances. Weight optimization was performed by minimizing the error variance of the index. Results Composite reliabilities of 0.85 and 0.86 were found for the DAS28-ESR and DAS28-CRP, respectively, and were approximately equal across patients groups. Component reliabilities, however, varied widely both within and between sub-groups, ranging from 0.614 for GH (“slow response” group) to 0.912 for ESR (“fast response” group). Weight optimization increased composite reliability even further. In the total and “fast response” groups, this was achieved mostly by decreasing the weight of the TJC28 and GH. In the “slow response” group, though, the weights of the TJC28 and SJC28 were increased, while those of the inflammatory markers and GH were substantially decreased. Conclusions The DAS28-ESR and the DAS28-CRP are reliable instruments for assessing disease activity in early RA and reliability can be increased even further by adjusting component weights. Given the low reliability and weightings of the general health component across subgroups it is recommended to explore alternative patient-reported outcome measures for inclusion in the DAS28.


Introduction
If a concept or condition is too complex to measure with a single instrument, multiple measurements are often combined into a linear composite score (i.e. an index measure). For rheumatoid arthritis (RA) the 28-joint Disease Activity Score (DAS28) is such an index measure, widely used for determining a patient's degree of disease activity [1]. It consists of 4 different individual components: a 28-tender joint count, a 28-swollen joint count, a patient-reported rating of general health, and a non-specific acute phase reactant of systemic inflammation which can be either the erythrocyte sedimentation rate (ESR) or the C-reactive protein (CRP). Each component has its own specific weight in the composite score, based on canonical discriminant functions for classifying high and low disease activity.
The DAS28 has received much attention over the years and has been shown to be a valid measure [1,2]. However, since reliability is a prerequisite for validity [3], the index should be reliable as well. While several studies have already determined the reliability of the DAS28 using Cronbach's Alpha [2,4,5], it is not appropriate to use this internal consistency measure with an index measure, as opposed to scales. Where a scale consists of correlated items which all measure the same construct, an index consists of items which are not necessarily highly correlated but which are considered indicators because they themselves define the construct [6]. As a result, the components might measure completely different aspects of disease activity, which is also the case with the DAS28. This poses significant methodological challenges for reliability testing. To overcome these challenges, generalizability theory [7,8] can be used to estimate the reliability of an index score by disentangling different sources of error.
As such, the first aim of this study was to determine the reliability of the DAS28 using generalizability theory. Since reliability is a concept defined relative to a specific population of patients [9], the second aim of this study was to examine whether the reliability of the index is acceptably high in relevant subpopulations and whether this reliability can be increased by adjusting the weightings of the individual component scores within the DAS28, which was shown to be the case.

Ethics Statement
As evaluated by the ethics committees of the participating hospitals, and in accordance with Dutch law, no ethical approval was required because data collection took place in daily clinical practice. Nevertheless, informed consent was obtained from each patient.

Patients
Patients were drawn from the remission induction cohort of the Dutch Rheumatoid Arthritis Monitoring (DREAM) registry [10]. This observational multicenter cohort started in 2006 and, although patient recruitment for this cohort has stopped in 2012, data collection of included patients is still ongoing. For this study, all data available at 0, 3, 6, 9, and 12 months were accessible for analyses. To be eligible for inclusion in the cohort, patients were DMARD and prednisolone naïve, they were allowed to have a maximum symptom duration of 1 year, and they needed to be 18 years or older. Additionally, they were not in remission, as measured with the DAS28. Early RA classification was based on a clinical diagnosis by the rheumatologist. For the present study, patients were classified into 2 groups as identified in the study by Siemons et al. [11]: a ''fast response'' group of patients quickly reaching remission and a ''slow response'' group of patients reaching remission at a slower pace. Because reliability calculations are sample dependent slight differences can be expected between the two response groups.

Measures
The primary measures of interest were the 28-tender joint count (TJC28), the 28-swollen joint count (SJC28), a 100 millimeter visual analogue scale on general health (GH: where 0 = very good and 100 = very bad), the erythrocyte sedimentation rate (ESR), and the C-reactive protein (CRP). Using these variables, the DAS28-ESR and DAS28-CRP were calculated as follows [12]: Patients additionally completed the Health Assessment Questionnaire (HAQ) which measures physical functioning [13] and the 36-item Short Form Health Survey (SF-36) which assesses physical and mental health status [14].

Statistical analyses
Baseline between-group comparisons were made using independent t-tests for normally distributed continuous variables, Kruskal Wallis tests for variables with skewed distributions, and Chi-Square tests for dichotomous variables.
Although a DAS28 score is assumed to represent a patient's disease activity, in reality this score consists of two parts: 1) the actual (true) score of that patient's disease activity, and 2) random measurement errors [3]. Errors give rise to an under-or overestimation of the true score. This might be due to, among others, certain distractions during test administration, the patient's mood while filling out the test, or a misreading of the items. Reliability is a representation of measurement consistency; it is the ratio between true score variance and observed total score variance, where the latter consists of a true score part and an error part [3]. Higher error variance leads to lower reliability. However, although these basic principles do apply to the individual components of the DAS28, the composite reliability of an index also depends on the interrelationships of its components. Composite reliability is a function of the reliability of the individual components, the weights that are assigned to the components as reflected in the DAS28 formulas, the variances and error variances of the component scores, and the correlations and covariances between the different components. All this can be combined into the following formula [8,15]: Where: I. w i is the weight of component i, as defined in the DAS28 formulas 1 and 2 described above; II. s Xi,Xj is the covariance between the two components, which can be rewritten as: Correlation Xi,Xj * Standard deviation Note that the nominator of the ratio in formula 3 is the error variance of the index, while the denominator is the total variance. In this study, the error variances of joint count components were calculated from their observed variance and reliability levels, whereas the error variances of GH and both inflammatory measures were obtained from univariate linear regression analyses. However, two significant problems arose during component reliability computations.
First, it was not appropriate to calculate the reliabilities of the joint count components with Cronbach's Alpha, since Cronbach's Alpha assumes the total score to be a linear combination of all items (i.e. each item is regarded as a parallel test of the other component items) whereas the joint count components are square root transformed in the DAS28. Consequently, split-half reliabilities were determined instead. As demonstrated by Siemons et al. [16], RA is characterized by a definite left-right symmetry of joint involvement. Consequently, the square root sum scores of the left and right joints were considered to represent two parallel tests and their correlation was used as an estimate of reliability.
Second, given the structure of the cohort, with time frames of several weeks or even months between consecutive measurements, it was not possible to perform proper test-retest reliabilities for the single-item components (i.e. ESR, CRP, and GH). Therefore, a generalizability theory principle was used to determine reliability. After running a univariate general linear model analysis on a longitudinal dataset (including all available data at 0, 3, 6, 9, and 12 months) the person variance could be separated from the time variance and reliability could be calculated as the ratio between person variance and total variance.
All computations of reliabilities were performed on the total patient group as well as on the two identified subgroups using SPSS version 21.0.
After calculating the composite reliabilities, it was investigated whether the reliability in both the total as well as the specific patient groups could be optimized by adjusting the component's weights. Optimal weights were computed by minimizing the error variance of the index, that is, the nominator of the ratio in formula 3, subject to the constraint that the total variance of the index does not change. The resulting quadratic optimization problem under quadratic constraints was solved using a procedure developed by Albers, Critchley, and Gower [17]. For more applications, refer to Albers, Critchley, and Gower [18].

Patients
A total of 565 patients were included for analysis; 466 from the ''fast response'' group and 80 from the ''slow response'' group. Overall, they had a mean age of approximately 58 years, the majority were women (63%) and rheumatoid factor positive (54%), and their baseline disease was characterized by several tender and swollen joints as well as by a diminished physical health status (Table 1).
No significant differences were found in age, mental health, or rheumatoid factor positivity between the fast and slow response group, but the slow response group did contain significantly more women. Additionally, the slow response group had a significantly worse disease condition, characterized by higher DAS28 scores, a poorer state of general and physical health, higher levels of inflammatory markers, and a higher number of tender and swollen joints, as compared to the fast response group. Tables 2-4 show the variances, error variances and reliabilities of the individual components together with their correlations and covariances. All correlations were small to moderate, as is not unusual in index measures. Component reliabilities varied widely both within and between sub-groups, ranging from 0.614 for GH in the ''slow response'' group to 0.912 for ESR in the ''fast response'' group. Furthermore, ESR component reliabilities were highest and the TJC28 outperformed the SJC28 across all groups.

Reliability calculations
Using formula 3 resulted in overall reliabilities of 0.85 and 0.86 for the DAS28-ESR and DAS28-CRP composites, respectively. Sub-analyses showed DAS28-ESR reliabilities of 0.85 and 0.82 and DAS28-CRP reliabilities of 0.85 and 0.84 for the ''fast response'' and ''slow response'' group, respectively. These results demonstrate that both DAS28 scores are approximately equally reliable across patient groups, that is, the differences were only small and all reliability levels were high (.0.80).

Optimizing reliability
The results of the optimization of composite reliability by adjusting the weights are shown in Table 5. The original weights are given in the third column. The next three columns give the estimated optimal weights in the total group and in the ''fast response'' and ''slow response'' groups, respectively. Further, in  the rows labeled ''Reliability'', the reliabilities using the original weights and the optimal weights are given. In all groups, reliabilities increased after weight optimization. The largest gains were obtained in the total and ''fast response'' groups by decreasing the weight of the TJC and GH. In the smaller slow response group, on the other hand, the weights of the TJC28 and SJC28 were increased, while the weights of the inflammatory markers and GH were substantially decreased.

Discussion
Overall, composite reliability levels of 0.85 and 0.86 were found for the DAS28-ESR and DAS28-CRP, respectively. This is sufficiently high for group use and around common thresholds considered sufficient for individual use [19,20], justifying the use of the DAS28 in both clinical research and clinical practice. Moreover, reliability could be increased even further by optimizing the component weights.    Several findings are worth mentioning when comparing the individual component reliabilities or when optimizing the component weights. At first, it can be observed that ESR had the highest reliability of all DAS28 components. The finding that this measure of inflammation was also more reliable than the CRP measure might be explained by their different responsiveness to changes in inflammatory stimuli. While the ESR is a relatively stable measure over time, which responds slowly to changes and reflects the disease activity of the past few weeks, the CRP fluctuates more heavily due to a more rapid response to short-term changes in the inflammatory stimuli [12,21,22]. The ESR is also given a higher weighting than the CRP in the DAS28 formulas and that remained to be the case after weight optimization.
When looking at the joint count reliabilities, the swollen joint count had a lower reliability than the tender joint count, consistent with findings from previous studies [9,23,24]. Joint counts are sometimes referred to as a semi-objective clinical measures [1] and, as discussed by Pincus [25], they have been shown to be poorly reproducible. Large intra-and interobserver variability is commonly found especially in the swollen joint count [24]. This might be explained by a higher dependency of the swollen joint assessment on factors like the assessors' levels of training and experience, a lack of standardization in examination methods, unclear definitions of swelling, or the degree of joint deformity [24,25,26]. After weight optimization, however, the weight of the tender joint count was substantially lowered, even below the weight of the swollen joint count. This corresponds with the common clinical perspective of disease activity in RA. Joint swelling is usually considered to be a more representative measure of inflammation than joint tenderness [27] and has been shown to play a major role in the physician's assessment of disease activity [28].
Finally, it can be observed that the patient reported degree of GH in the ''slow response'' group had the lowest reliability of all components, even below the recommended reliability threshold for group use (r.0.70). Its weight was also substantially decreased after weight optimization. This could be a confirmation of the weakness of this component, as the inclusion of GH in the DAS28 has been often criticized. For instance, previous studies have shown elevated GH scores while none of the other DAS28 components showed any sign of an active disease, possibly due to effects beyond the clinical inflammatory processes of RA [29]. Also, GH ratings have been shown to be different across patients with similar DAS28 scores, dependent on the moment of administration, possibly caused by a response shift [30]. The GH component is also the most subjective component of the DAS28 and, therefore, more susceptible to measurement error. Although it could be argued to solely include the more objective clinical measures, the inclusion of a patient-driven component is desirable given the increased awareness of the importance of the patient perspective in assessing disease activity since the 1980s [31] as reflected by their inclusion in the provisional ACR/EULAR definition of remission in RA [32] as well as in the preliminary core set of disease activity measures [33]. Disease activity in RA is a multifactorial concept and appears to be best measured by both objective clinical measures and patient-reported outcomes as they each address a different aspect of disease activity [27]. Therefore, it would be interesting to explore other, more reliable, patientreported outcome measures for inclusion in the DAS28. What measure would be best warrants further research. Measures of pain or fatigue appear to be promising alternatives, given the recognition of pain as one of the most important determinants of a patient's global assessment of disease activity [34,35,36] and the recommendation to measure fatigue in addition to the other core set measures of RA [37]. But, of course, other patient-reported measures can also be explored.
A possible limitation of the current study is the difference in sample size between the fast response and slow response group. Though this might cause the results of the ''fast response'' group to be more robust than the results of the ''slow response'' group, the importance of the tender joint count in the slow response group is consistent with the general belief that approximately 10-20% of patients with RA have secondary fibromyalgia (FM) [38,39]. FM patients tend to have a lower pain threshold [38,39] which might explain the higher relevance of the TJC28 rating in this patient group, increasing our confidence in the robustness of the results.

Conclusions
The DAS28-ESR and the DAS28-CRP are both reliable instruments for assessing disease activity in early RA although reliability can be increased even further by adjusting the individual component weights. Overall, the findings suggest that the largest gains in reliability can be achieved by substantially lowering the weights of the tender joint count and patient-reported general health. Future studies should explore the possibilities of including a better indicator of the patient perspective in the disease activity score.