Assessment of content validity and psychometric properties of VISA-A for Achilles tendinopathy

A recent COSMIN review found that the Victorian Institute of Sports Assessment–Achilles tendinopathy questionnaire (VISA-A) has flawed construct validity. The objective of the current study was to assess specifically the process of how VISA-A was constructed and validated, and whether the Danish version of VISA-A is a valid patient-reported outcome measure (PROM) for measuring the perceived impact of Achilles tendinopathy. The original item generation strategy for content validity and the process for confirming the scaling properties (construct validity) were examined. In addition, construct validity was evaluated directly using several psychometric methods (Rasch analysis, confirmatory factor analysis (CFA), and multivariable linear regression) in a cohort of 318 persons with Achilles tendinopathy with symptom duration groups ranging from less than 3 months to more than 1 year of chronicity, and a group of 120 healthy persons. We found that the item generation and item reduction in the original construction of VISA-A was based on literature review and clinician consensus with little or no patient involvement. We determined that 1) VISA-A consists of ambiguous conceptual item themes and thus lacks content validity, 2) there was no thorough investigation of the psychometric properties of the original version of VISA-A, which thus lacks construct validity, and 3) rigorous direct assessment of the psychometric properties of the Danish VISA-A revealed inadequate psychometric properties. In agreement with the COSMIN study, we conclude that when used as a single score, VISA-A is not an adequate scale for measuring self-reported impact of Achilles tendinopathy.


Introduction
The Victorian Institute of Sports Assessment-Achilles tendinopathy questionnaire (VISA-A) [1] is the most widely used patient-reported outcome measure (PROM) for studies of Achilles tendinopathy [2]. However, a recent editorial questions the usefulness of VISA-A, and underscores the limited evidence of its "clinimetric properties" [3]. Furthermore, a recent COSMIN checklist review concluded that all 11 versions of VISA-A have flawed construct validity and weak responsiveness [4]. The purpose of PROMs is to measure a person's perception of the impact of some pathology, not the physical pathology itself [5]. The measurement process consists of assigning numerical values to the responses to the questions (items). These values are summed in an overall (total) score that should represent the magnitude of the problem for the patient [6]. Each person receives a score on the "test", the higher the score, the greater the perceived level of pathology (or level of functional ability in the case of VISA-A). The validity of the PROM (i.e., whether it actually measures what it purports to measure) depends directly on the relevance and comprehensiveness of the items for the patient group being assessed (content validity) [7] and on whether the response scores to those questions when added together satisfy the basic constraints of measurement (construct validity) [8][9][10]. When PROM data is congruent with (i.e., 'fits') certain statistical measurement models, such as item response theory (IRT) or confirmatory factor analysis (CFA), it follows that the PROM possesses adequate psychometric properties [11][12][13].
As mentioned, significant flaws in the validity of VISA-A were found using the COSMIN checklist [4]. Considering the harsh judgement passed by the COSMIN review, we believe it could be relevant to evaluate specifically which methods were used to generate and choose the items in VISA-A, and which methods were originally used to test its psychometric properties. We also believe that it could be of value to directly evaluate the psychometric properties of VISA-A in our own setting (Denmark) using the most stringent methodology. This could give insights into why the COSMIN assessment resulted in such a negative finding.
This study had three major objectives: 1. Evaluate how VISA-A was created and validated by looking at how the items in the original version of VISA-A were generated and chosen for inclusion in the PROM and address content validity 2. Evaluate which statistical methods were used for the psychometric validation of the original VISA-A 3. Conduct a rigorous analysis of the psychometric properties of the local (Danish) version of VISA-A in a cohort of patients with Achilles tendinopathy and healthy controls.

Methods
First, in order to address the methodological quality of the content validation of the original version of VISA-A, we assessed the methods used to generate and select the items that constitute the body of VISA-A. This included an evaluation of whether the items were generated from the perspective of clinicians or from interviews with patients with respect to the relevance and coverage of the items (content validity) [7]. We also looked at whether the content of the items in the proposed scale were thematically homogenous and whether the response options were logical and easily understood. Next, we assessed the process used by the creators of VISA-A to confirm the psychometric properties. Thus, we looked at the original methods that were used to evaluate the factor structure and dimensionality of VISA-A, whether there was evidence of fit to an appropriate measurement model, and whether the authors assessed differential item functioning (DIF). DIF is the presence of bias due to different response patterns in specific items between subgroups, such as sex, age group, or injury chronicity [14][15][16][17]. DIF is detrimental to scale properties since it can mask real differences, or detect differences between subgroups, that are not due to real changes [16]. In such cases, if DIF is present, one cannot discern whether detected differences in VISA-A scores are caused by DIF or real differences in the criterion that VISA-A is assumed to measure. DIF is investigated in models that assess the independence of a list of background variables on the items conditional on the full VISA-A score. Dimensionality is investigated by assessment of data fit to measurement models such as CFA or IRT.
Lastly, we conducted our own analyses of the psychometric properties of VISA-A. These included Rasch IRT, multivariable linear regression, and CFA. The sample was a cohort of 318 persons with Achilles tendinopathy (symptom duration ranging from less than 3 months to more than a year), and a group of 120 healthy persons. The subjects with symptoms < 3 months were sports active participants of both sexes, age 18 or older, with mid-portion Achilles tendon pain recruited from various local sports clubs. The subjects with symptoms > 3 months were 18 to 65 years of age, of both sexes, with mid-portion Achilles tendon pain seen at a sports medicine clinic and a rheumatology outpatient clinic [18]. The healthy persons were male participants, 19 to 90 years, in the 2017 European Masters Athletics Championships [19].

Analysis strategy
We employed several techniques to assess the psychometric properties of VISA-A. First, fit to a Rasch unidimensional measurement model was assessed using Andersen's conditional likelihood ratio test (CLR) [11]. Overall fit was investigated through obtaining item-trait interaction chi-square values (a non-significant chi-square indicates good fit) [20,21]. Individual item fit was assessed by standardized individual item-person fit residuals (i.e., the difference between observed and expected scores) to approximate a Z-Score, where values between ±2.5 indicates adequate fit to the model [20,21]. DIF was assessed using analysis of variance [22] for Sex, Age group (+/-44 yrs), BMI (+/-25), and duration of symptoms (� 3 months, 4-12 months, � 12 months, and no symptoms at all) [16,23]. For DIF analyses, the cutoff of +/-44 years of age was chosen because the median age for the sample group was 43.6 years. This allowed for a dichotomization of younger versus older persons for comparison of scoring patterns across the groups. For BMI, the value of 25 was chosen, as this was also the median value for the group. The duration of symptoms groups were chosen to allow for a comparison of scoring patterns across groups that could be expected to have different levels of severity of symptoms (i.e., less than 3 months would be acute symptoms, 4 to 12 months would approach chronic symptoms, and more than 12 months would be manifest chronic tendinopathy).
Due to skewed item response data, which hindered parameter estimation in the Rasch model, we carried out a transformation of the response structure. See the details of this in the results section below.
Next, as the Rasch analysis was performed on transformed data, we used CFA to assess factor structure using the original response data (non-transformed). In these analyses, three separate factor structures of the VISA-A were assessed: the original unidimensional structure, a 2-factor structure (items 1-5 and 6-7), and a 3-factor structure as indicated by the authors in the original paper (items 1-3, 4-6, and 7-8). CFA model fit was assessed with the goodness of fit index (GFI) > 0.95; root mean square error of approximation (RMSEA) < 0.06; standardized root mean square residual (SRMR) < 0.06; and the Comparative Fit Index (CFI) > 0.95 [24,25].
Lastly, we assessed DIF for the same person characteristics as for the Rasch analyses (Sex, Age group, BMI, and Symptom Duration) in multivariable regression analyses of the individual items, also using the original non-transformed data.
RUMM 2030 was used for the Rasch analysis [23]. CFA and regression analyses were carried out with SAS v9.4. Data for the analyses were accessed from trials conducted at our facility: ClinicalTrials.gov Identifier: NCT03401177 and ClinicalTrials.gov Identifier: NCT02580630. All studies were approved by the local institutional review and ethics committees.

Content validity-How was VISA-A developed?
Assessment of the original paper describing the construction of VISA-A [1] revealed that item generation and item reduction was based on literature review and clinician consensus with little or no patient involvement. Further scrutiny showed that each item possesses a mix of themes addressing stiffness, pain, and perceived level of ability, both within individual items and across the 8-item scale. The reader is referred elsewhere for a formal presentation of the items and response structure in VISA-A [1]. However, item 3 is an example of a complicated item concerning pain within the next 2 hours after walking on flat ground for 30 minutes. Is the person scoring pain, the ability to walk on flat ground for 30 minutes, or the ability to walk at all due to pain for 2 hours after having walked for 30 minutes? Items 6, 7, and 8 are also complex with complicated scoring options and thematic ambiguity. Most notable is item 8, which has a mutually exclusive "either. . .or" scoring structure that crosses categories of pain/ no pain with level of training ability. Intuitively, patients may be confused and uncertain as to which theme they are responding. Such items are known as 'double barreled items' or 'ambiguous' [26].
Moreover, as VISA-A is scored as a single index (total score); this assumes that all items address unique aspects of the same overall construct (a single dimension). However, closer inspection reveals that the item content addresses both symptoms, activity level, and activity duration (which are separate constructs). Moreover, the items ask about less demanding functional activities (items 1-5), and more demanding sports-related activities (items 6-8), which potentially are different situational contexts. Indeed, in the original paper by Robinson et al. [1], they mention that VISA-A covers three separate domains of pain, function, and activity, which indicates an underlying multidimensional structure that would not support calculating a singular total score.
In terms of response structure, items 1-6 use a 0-10 numeric rating scale with 11 response options, instead of adjectival response scales, as is more typical for PROMs [7]. Items 7 and 8 do have a 4-option categorical structure, which is transformed to 0-10 rating scale (probably in order to fit in with the other items). However, the result is that items 1-7 can achieve up to 70 points, while item 8 has a max score of 30 points. Thus, the VISA-A can tally a maximum score of 100, although there is no obvious reason for assigning item 8 three times the weight of the others. Nor is there an explanation or description of how and why the clinician-based focus groups chose to include the selected eight items (and thus exclude other potential items). This process does not satisfy the general principles of establishing content validity, which requires face-to-face cognitive interviews with the targeted patients to confirm both the relevance, coverage, and understandability of the items and response options [7,26].
An additional issue with item 7 is that this is the only item where it is explicitly assumed that the person has symptoms. Hence, in a strict sense, item 7 is not relevant for people without Achilles tendon symptoms, which is problematic if healthy persons were used in the validation process (as was the case for VISA-A). An alternative wording such as "6 months ago" instead of "when the symptoms started" might remedy this. In addition, a max score on item 7 can only be achieved for competitive athletes, whereas recreational athletes with high-volume training, who do not participate in competitions, receive a lower score regardless of their actual level of functional ability. These are issues that relate directly to item relevance and comprehensiveness in VISA-A.

Construct validation-How was VISA-A validated?
To test construct validity, the creators of VISA-A calculated Spearman correlation coefficients between VISA-A and two legacy PROMs (i.e., the Percy-Conochie and Curwin-Stanish scales) in a group of non-surgical patients (n = 45), a pre-surgical group (n = 14), and 83 healthy persons. The psychometric properties (i.e., whether VISA-A behaves as a proper measure) were not investigated. This is highly unfortunate, as analysis of these properties form the core of construct validation, notably the most problematic violations of these properties: multidimensionality and differential item functioning (DIF). Ignoring multidimensionality can at best induce variance and make for a weak instrument, or worst case, when the dimensions engage in a trade-off, make for a meaningless instrument. Table 1 shows the characteristics of the people in the sample, the variables used for the DIF analyses, and the VISA-A total scores across subgroups.

Psychometric analyses of the Danish VISA-A
Rasch analysis. Fit to a partial-credit Rasch model was attempted, but initial model estimation failed due to ceiling effect in the item response scales for items 1 through 5. An example of this is seen in Fig 1, which exhibits the frequency distribution of response scores on item 3 for patients with symptoms lasting 3 months or less. The failed parameter estimation was also likely due to excessive response categories, with 10 category probability thresholds to be estimated for each item, with the exception of item 7 with 4 categories and thus 3 thresholds, which yields 73 threshold estimates for all items combined.
To remedy this, the 0-10 response scales were recoded into four categories, matching the response structure of item 7. The recoding was: (0-2 = 0), (3-5 = 1), (6-8 = 2), and (9-10 = 3). A 4-category response scale was then established for all items, which resulted in 24 thresholds and successful model estimation. Table 2 shows that overall fit to the Rasch model was rejected for the combined item set (significant chi-square). Individually, items 2, 3, 5, and 7 exhibited misfit and DIF was observed for Sex in item 3 and for Duration Group in items 3, 5, 6, 7, and 8 ( Table 2). DIF was not observed for BMI or age group. Splitting items for DIF [16] for Sex and Duration Group did not remedy model fit (these results not shown). Confirmatory factor analysis (CFA). As the Rasch analysis was carried out on transformed data, thus ignoring the specific scoring and weighting of the items inherent to VISA-A, we used the original scoring of the items for the CFA analysis and multivariable regression. Consistent with the Rasch results, CFA rejected a unidimensional scale and confirmed a  multidimensional structure with items 1-5 in one dimension and items 6-8 in the other (i.e., a 2-factor solution). CFA indicated even more strongly a 3-factor structure with items 1-3, 4-6, and 7-8 in separate dimensions. The results are seen in Table 3. Multivariable regression. Multivariable regression analyses confirmed DIF for sex in item 3 and DIF across duration of symptoms for most items (also seen in the Rasch analysis). Table 4 shows the full DIF results. This DIF persisted in a sensitivity analysis where the 120 participants in the 2017 European Masters Athletics Championships were omitted (results not shown).

Discussion
These results support the findings of the COSMIN review conducted by Ortega-Avila and colleagues [4], in that we could confirm significant flaws in the validity of VISA-A. As a PROM, VISA-A clearly lacks content validity, as patients were not included in the process of item generation or item reduction, the adequacy of the measurement properties of VISA-A was never confirmed using appropriate validation methods, and our own rigorous analyses using data from patents with Achilles tendinopathy revealed substantial problems. Without patient feedback to generate item content, how do we know if relevant and understandable questions are being asked? Does the scoring structure of each item make sense for patients? For example, what does it actually mean for a patient to score a 3 or an 8 on item 3? Trying to determine where to score on an arbitrary pain scale to describe the level of pain that is expected within the next 2 hours after walking 30 minutes is a complex question to answer. The fact that there is extensive ceiling effect for most items (items 1-5) across all duration groups indicates that those items fail to target patients with Achilles tendinopathy adequately. Such a ceiling effect is only ever justifiable for persons without symptoms.
We found that the original validation was superficial. It was based on Spearman correlation of scores from only 59 patients regressed against other legacy PROMs, which themselves may not reflect good measurement of Achilles tendinopathy. This cannot be considered an assessment of the instrument's psychometric properties, but simply a measure of criterion validity, which does not ensure that the criterion instrument used for reference is trustworthy or valid. No tests of dimensionality, fit to a measurement model, or tests of person-item bias (DIF) were conducted.
Our own analyses of these components on a broad sample of patients and healthy persons revealed several problems with the intrinsic measurement properties of VISA-A. First, the assumption of unidimensionality was rejected. Hence, the computation of VISA-A as a total score is problematic. A possible solution here is to divide VISA-A into the two or three subscales that were confirmed using CFA. This is a strategy that can be implemented retrospectively (i.e., pre-existing historic data can still be used to calculate a multidimensional VISA-A score).

PLOS ONE
Validity of VISA-A Table 4. Multivariable regression analysis of differential item functioning (DIF) on the covariates sex, duration of symptoms, body mass index (BMI), and age group. Probably the greatest problem was that there was DIF for the covariate 'duration of symptoms' across all but one item, which suggests that VISA-A measures a different construct for patients in the different symptom duration groups. When the construct being measured changes over time, the meaning of intervention effects that reach over longer periods is undermined, particularly if the groups compared are defined by the duration of symptoms. Hence, to neutralize this DIF, we suggest that if VISA-A is used as outcome, conducting trials with follow-up longer than three months should be avoided, and only comparison of patients that all have the same (short) duration of symptoms at baseline should be undertaken. This is important because comparisons of constructs that change over time or between groups will undermine the interpretation of intervention effects.
In contrast with our results, other studies have found VISA-A to be valid and reliable [27]. However, a closer review of those reports reveals that the validation methods closely mirror those used in the original paper [28,29], which means they fail to satisfy the basic constraints of content validity and the psychometric measurement properties. There are two notable exceptions. One group [30] found a found a 2-factor structure for items 1-6 and 7-8 using exploratory factor analysis, but with only 51 patients, and the fact that confirmatory tests were never performed, the results cannot be considered robust (although they somewhat agree with our findings). A more recent study concluded that a 1-factor solution was viable using CFA [29]. However, the analysis was based on data from just 70 patients, and the study unfortunately did not assess measurement invariance (DIF).
We failed to generate a Rasch model for the proposed 8-item construct with the 11-category visual analog response scales. We therefore restructured the response scales, which allowed for successful parameter estimation. However, this still revealed substantial misfit and DIF. In order to accommodate the original scoring structure of the VISA-A, we conducted analyses of dimensionality and DIF in the original format using CFA and multivariable regression. Here, we found no major differences between the results of the Rasch analysis, the CFA, and the linear multivariable analyses, and therefore feel justified in our choice of methods.
Our results support the findings of Ortega-Avila and colleagues [4], in which which they used the COSMIN checklist and found significant flaws in the construct validity of VISA-A. We chose not to apply COSMIN for our analyses. First, because it would have been redundant, as Ortega-Avila et al. included the Danish version in their study, and second, while COSMIN is an exhaustive tool for assessing which methods have been used to create and validate PROMs, it does not specifically address the superiority of one validation method relative to others. For example, COSMIN does not consider whether CFA or IRT is more (or less) robust than for example exploratory factor analysis (EFA), or correlation with legacy instruments (criterion validation). Therefore, we applied the most robust assessment methods to assess the psychometric properties, as we found no studies that previously had applied Rasch IRT, CFA, or the multivariable analyses we chose to use. Moreover, while Ortega-Avila et al. specifically targeted the 11 studies in the different language versions of VISA-A that assessed the construct validity and measurement characteristics, we focused more on the process behind the genesis and the validation of the original PROM and sought to verify these results with our own analyses.

Conclusion
VISA-A is not a robust scale for measuring Achilles tendinopathy. It lacks content validity and construct validity, and thorough validation methods were not used to test its measurement properties during the development phase or subsequently thereafter. Furthermore, rigorous psychometric assessment of the Danish version revealed that VISA-A does not satisfy a measurement model, lacks unidimensionality, and exhibits DIF depending on the duration period of symptoms. A new relevant PROM for Achilles tendinopathy should be developed and appropriately tested for validity. Meanwhile, simple pain scoring (e.g., numeric rating scales) and functional tests are suggested as more appropriate outcome measures for studies of Achilles tendinopathy. VISA-A sub-scores can still be calculated as described in the original paper, which means that existing research using VISA-A data need not be discarded. However, this option does not address the poor psychometric properties of VISA-A.