How to Capitalize on the Retest Effect in Future Trials on Huntington’s Disease

The retest effect—improvement of performance on second exposure to a task—may impede the detection of cognitive decline in clinical trials for neurodegenerative diseases. We assessed the impact of the retest effect in Huntington’s disease trials, and investigated its possible neutralization. We enrolled 54 patients in the Multicentric Intracerebral Grafting in Huntington’s Disease (MIG-HD) trial and 39 in the placebo arm of the Riluzole trial in Huntington’s Disease (RIL-HD). All were assessed with the Unified Huntington’s Disease Rating Scale (UHDRS) plus additional cognitive tasks at baseline (A1), shortly after baseline (A2) and one year later (A3). We used paired t-tests to analyze the retest effect between A1 and A2. For each task of the MIG-HD study, we used a stepwise algorithm to design models predictive of patient performance at A3, which we applied to the RIL-HD trial for external validation. We observed a retest effect in most cognitive tasks. A decline in performance at one year was detected in 3 of the 15 cognitive tasks with A1 as the baseline, and 9 of the 15 cognitive tasks with A2 as the baseline. We also included the retest effect in performance modeling and showed that it facilitated performance prediction one year later for 14 of the 15 cognitive tasks. The retest effect may mask cognitive decline in patients with neurodegenerative diseases. The dual baseline can improve clinical trial design, and better prediction should homogenize patient groups, resulting in smaller numbers of participants being required.


Introduction
Huntington's disease (HD) is an inherited neurodegenerative disorder involving motor, behavioral and cognitive impairments [1]. The cognitive disorders have a major impact on daily life, but most clinical trials focus on motor endpoints. This is because clinical trial endpoints must be able to capture both patient decline and treatment efficacy, and cognitive decline is much more difficult to capture within one year in patients at early disease stages [2] than motor decline. This difficulty of assessment results from the heterogeneity of cognitive changes (language, memory, etc.) and two opposing effects: the retest effect and patient decline due to disease progression. The retest effect is defined as an improvement in performance with repeated exposure to a task, with the greatest improvement occurring between the first two assessments [3][4][5]. This effect combines familiarity with the task and testing environment and the possible recall of responses [2]. The first assessment, during which everything is new to the patient, is always the most difficult.
The retest effect may have contributed to the failure of some neuroprotection trials, by adding noise to statistics comparing patients with different backgrounds at baseline, particularly in trials including small numbers of patients, such as those assessing biotherapy. One approach to neutralizing the retest effect is to carry out a second assessment (A 2 ) shortly after the first (A 1 ), and then discard the results obtained at A 1 from the analysis, using performance at A 2 as the baseline [2]. In addition, the retest effect (ΔA 2 -A 1 ) can be used to improve the prediction of long-term patient performance. Indeed, in an observational longitudinal study in HD patients, the retest effect (ΔA 2 -A 1 around 7 months) accounted for up to 36% of the variance of performance at A 3 (ΔA 3 -A 2 around 29 months) [6]. Likewise, in healthy elderly adults, performance at A 3 (one year) is accurately predicted by the one week-interval retest effect (ΔA 2 -A 1 ) [7].
However, the impact of the retest effect in clinical trials, which include additional variability (placebo effect, hope, anxiety about treatment and randomization), remains unknown. Two trials, the Multicentric Intracerebral Grafting in Huntington's Disease (MIG-HD) [8] and Riluzole in Huntington's Disease (RIL-HD) [9] trials, were designed with a short-term test-retest procedure. We used the MIG-HD trial (i) to assess whether the retest effect modified performance and whether our strategy of using the second assessment as a baseline was sensitive to cognitive decline in the long-term (A 3 ) and (ii) to evaluate whether introducing the retest effect (ΔA 2 -A 1 ) into the model of disease progression in patients improved the predictive value of the model in the long term (A 3 ). Finally, we transferred the models obtained for the MIG-HD cohort to the RIL-HD cohort, to assess their predictive value in another population.

Participants and design
Patients were enrolled in two separate trials: the MIG-HD trial (N = 54, Ref NCT00190450, PI AC Bachoud-Lévi) [8], which is currently underway, and the placebo group of the cognitive ancillary study of the RIL-HD trial conducted only in France (N = 39, Ref NCT00277602, study coordinator Sanofi) [9]. Both trials were approved by the institutional review board (Comité Consultatif de Protection des Personnes dans la Recherche Biomédicale) of Henri-Mondor Hospital at Créteil (MIG-HD the September 25, 2001, and RIL-HD the December 18, 2002). Patients had signed an inform consent. The data were analyzed anonymously.
The MIG-HD trial is a phase II randomized trial assessing the efficacy of cell transplantation in HD patients at early stages of the disease. Patients were assessed at inclusion (A 1 ), then 35 days (SD = 15) later (A 2 ). They were randomized at one year (A 3 ), to determine the timing of Patients were followed up until 52 months. The RIL-HD trial is a phase III multinational, randomized, placebo-controlled, doubleblind study, for which a cognitive ancillary study was conducted in France from 1999 to 2004, on patients with moderately advanced HD. Patients were assessed at inclusion (A 1 ), 15 days (SD = 8) later (A 2 ) and at one year (A 3 ), with randomization at A 2 .
The demographic features for patients at A 1 are displayed in Table 1.

Clinical assessments
The Unified Huntington's Disease Rating Scale (UHDRS) [10] and additional cognitive tests were used in both studies. Motor score reflected both voluntary and involuntary capacity and ranged from 0 to 124 (highest severity). Functional disability was assessed with Total Functional Capacity (TFC, range: 13 to 0) and Independence Scale (IS, range 100 to 0) scores, with lower scores indicating greater functional impairment, and the Functional Assessment Scale (FAS, 25 to 50), with higher scores indicating greater functional impairment. The severity and frequency of behavioral dysfunctions were quantified with the behavioral part of the UHDRS (range: 0 to 88), with higher scores indicating greater impairment. Global cognitive efficiency was evaluated with the Mattis Dementia Rating Scale (MDRS) [11]. Several tasks were used to assess attention and executive functions: letter fluency (for P, R and V in French) determined for 1 minute, the Symbol Digit Modalities Test (SDMT), the three components of the Stroop test (color naming, word reading, and color-word interference), each assessed for 45 seconds [12], categorical fluency (for animals) assessed for 1 minute [13], [14], the Trail-Making Test forms A and B (TMT A and B) [15], scoring the time taken to link 25 points, with a maximal time of 240 seconds, and figure cancellation tasks [16], in which patients were asked to cross out one, two and then three figures from a panel of signs, in 90 seconds, with lower scores indicating greater cognitive impairment. Short-term and long-term memory were evaluated with the Hopkins Verbal Learning Task (HVLT) including immediate recall, delayed recall and recognition tasks [17], [18]. By contrast to the other tasks, the HVLT was assessed with alternating parallel forms. Each patient performed one motor test, three functional tests, one behavioral test and 15 cognitive tests at each assessment point.

Statistical Analysis
Evaluation of the retest effect in the MIG-HD cohort. For each task, we used Student's ttests for paired data to compare performances, first between A 1 and A 2 , to measure the potential retest effect, then between A 1 and A 3, to assess the decline over a one-year period and between A 2 and A 3 , to determine whether discarding the A 1 data unmasked a decline that was otherwise undetectable.
Modeling of performance for the MIG-HD cohort. For each task, we selected the multivariate linear model best predicting the data at one year, by stepwise selection [19] with the Akaike Information Criterion (AIC) [20]. We used an iterative algorithm (stepwise selection) to select, without prior assumptions, the best predictive factors from a set of 10 variables (performance at A 1 , retest, age, sex, education level expressed as the number of years spent studying, parental inheritance, age of parent at disease onset, CAG repeat length, time since disease onset and the nature of the first symptom appearing at disease onset (motor, cognitive or psychiatric), as determined by the clinician or, if no clinician's assessment was available, by the family or the patient). Lower AIC values indicate a better fit of the model to the data. The first model selection step was carried out for patients with complete data sets only. Estimates of regression coefficients were refined, by recalculating each model, using all the available complete data for the selected variables. The retest is the difference: performance at A 2 -performance at A 1 and is denoted ΔA 2 -A 1 . For each task, performance at A 3 (P) was predicted as follows: where age, education level and age of parent at onset are expressed in years; the first symptom could be motor, cognitive or psychiatric; β 0 is the intercept and, for each variable, β variable is its associated regression coefficient (0 for the variables not selected). For quantitative variables, β variable was multiplied by the value of the variable. For qualitative variables (sex, inheritance and first symptom), "woman", "maternal inheritance" and "motor symptom" constituted the reference factors, such that β woman = β maternal = β motor = 0. Calculation of the associated 95% predictive interval (95% PI) is explained in the supplemental data (S1 Text).
External validation on the RIL-HD cohort. We used models constructed from data for the MIG-HD cohort to predict performances at A 3 for each patient in the RIL-HD cohort. Then, for each task, we measured the concordance between observed (O) and predicted (P) values, using the intraclass correlation coefficient (ICC) and the coefficient of determination (R 2 e ). The ICC was calculated with a two-way mixed effect model [21] and evaluates agreement between observed (O) and predicted (P) performances at A 3 in the RIL-HD cohort. The coefficient of determination (R 2 e ) is the percentage of the observed performance variance explained by the model constructed from MIG-HD data. It assesses the degree to which observed performance at A 3 in the RIL-HD cohort is accurately predicted by the model, as follows: where i refers to a patient and m is the mean observed performance at A 3 . R 2 e = 1 indicates a perfect predictive value of the model, whereas R 2 e 0 indicates that the model is not informative.
Analyses were performed with R 2.13 software (http://www.r-project.org/). All tests were two-tailed and values of P < 0.05 were considered significant.

Evaluating the retest effect in the MIG-HD cohort
We assessed the retest effect between A 1 and A 2 in the MIG-HD cohort. Performance improved in seven cognitive tasks, and remained stable in the other cognitive, motor and functional tasks, except for FAS score, which declined between A 1 and A 2 (Fig 1).
We assessed decline between A 1 and A 3 and between A 2 and A 3 in the MIG-HD cohort (Fig  2). The use of A 2 as the baseline increased the number of tasks for which a decline in performance was detected from three to nine, but FAS score was the only motor or functional performance affected. Indeed, FAS performance declined between A 1 and A 3 but not between A 2 and A 3 . Behavioral performance improved between A 2 and A 3 . Table 2 displays the regression coefficients of the predictive model for each task, for the MIG-HD cohort. Performance at A 1 was predictive of performance at A 3 in all tasks. Introducing the difference in performance between A 1 and A 2 (ΔA 2 -A 1 ) into the models improved the prediction of performance at A 3 for 14 of the 15 cognitive tasks, for behavioral and motor performance and TFC. Larger numbers of CAG repeats were associated with a poorer FAS and IS scale scores and poorer motor performance, but better behavioral performance. Women outperformed men in 7 of the 15 cognitive tasks. Sex had no effect on motor and functional performances, whereas behavioral performance was better in women than in men. Higher education levels were associated with better performance at A 3 for all components of the HVLT.

Modeling of performance in the MIG-HD cohort
The regression coefficients presented in Table 2 are those used in the predictive models. For example, the performance at A 3 in letter Fluency 1' is given by the following formula: performance at A 3 ¼ 10:27 þ 0:66 Â performance at A 1 þ 0:84 Â retest woman 10:27 þ 0:66 Â performance at A 1 þ 0:84 Â retest À 2:55 man ( The equations associated with the predictive models for each task are detailed in S1

External validation on the RIL-HD cohort
For each task, we determined the predictive value of models by calculating the ICC and R 2 e (Fig  3). Performance in the RIL-HD trial was well predicted for 14 of 20 tasks by the models developed with data for the MIG-HD cohort (R 2 e ! 0.5 and ICC ! 0.6).

Discussion
The design of clinical trials for neurodegenerative diseases could be improved by methodological approaches based on our knowledge of the patient's cognitive performances. However, cognitive knowledge is obtained mostly through longitudinal follow-up in observational studies, which may not include variability factors inherent to clinical trials. The retest effect may impede observations of cognitive decline in patients with Huntington's disease. We therefore assessed its impact in two long-term clinical trials in HD patients, with a short interval between the first and second assessments (MIG-HD, RILH-HD). We first determined whether there was a detectable retest effect between the first two assessments (A 1 and A 2 ), and then evaluated the impact of this effect one year later (A 3 ). We found that replacing A 1 with A 2 as the baseline unmasked a decline that would not otherwise have been detected. Indeed, the comparison between A 2 and A 3 showed declines that were not apparent in the comparison between A 1 and A 3 . We also modeled patient performance and showed how the inclusion of the retest effect in patient performance models would improve trial design.
At one year, decline was observed in a few cognitive tasks (SDMT, MDRS and the HVLT immediate recall), the motor task and all functional tasks. However, consistent with previous findings [2], there was a pronounced retest effect in cognitive tasks (letter fluency, SDMT, Stroop color and color/word interference, TMT A and 2-and 3-figure cancellation tasks), but no such effect in motor and functional assessments. This retest effect may hamper the objective detection of cognitive decline, with a major impact in tasks with a high cognitive demand, obscuring performance decline over a one-year period [22]. Neutralization of the retest effect is particularly important in clinical trials, because some patients may already have been exposed to testing whereas others have not, adding background noise to the overall performance data. Assuming that the retest effect is maximal at the second assessment, the use of this assessment as the baseline can decrease the impact of the retest effect on subsequent assessments. By discarding performances at A 1 and using the performance measured at A 2 as the baseline, we unmasked a decline in six tasks (Stroop color and color/word interference, recognition part of HVLT, TMT A and 2-and 3-figure cancellation), demonstrating the efficacy of this strategy for small samples. However, the improvement in behavioral performance [23], contrasting with the decline in other task performances, may reflect the patients' hopes and expectations of treatment.
The HVLT constitutes a specific case: we alternated parallel forms because of the strength of item recall in declarative memory tasks [24]. However, alternation was not used for other tasks, because parallel forms are of no interest for procedural tasks or tasks with a strong motor output (SDMT, TMT A and verbal fluency tasks) [25]. The use of parallel forms should also be limited because of their low intrasubject equivalence, potentially introducing noise into longitudinal performance analyses. Furthermore, the ceiling effect observed in patients with high scores in the HVLT, MDRS and TMT tasks limits the utility of neutralizing the retest effect.
However, the retest effect depends not only on the nature of the task, but also on the population assessed [26]. Indeed, Cooper et al. [27], [28] demonstrated the existence of a retest effect in categorical fluency assessment in healthy participants but not in patients with Alzheimer's disease or mild cognitive impairments. Likewise, we found no retest effect for this task in HD patients.
In addition to masking decline, the retest effect may provide information about disease progression [7]. This suggests that combining a strategy based on the individual performance of patients and the nature of the tasks may be useful. Indeed, the modeling of patient performance at one year for each task showed that ΔA 2 -A 1 performance, even in the absence of a significant retest effect, accurately predicted performance for most cognitive tasks in HD and for motor and behavior tasks and TFC. ΔA 2 -A 1 performance appears to be more frequently selected by stepwise algorithms than sociodemographic and genetic variables. We also arbitrated between parameters to strengthen our models. For example, both the number of CAG repeats and age The red curve represents the baseline (reference score). The blue (or green) curve corresponds to the mean relative score one year later (A 3 ), with A 1 (or A 2 for the green curve) used as the baseline. A green curve within the blue curve indicates that the decline was easier to detect if A 2 was used as the baseline, rather than A 1 . Paired t-tests, significance: * P<0.05, ** P<0.01, *** P<0.001.    at onset are eligible variables [29], but they are correlated [30][31][32], so only one of these factors should be included in the model [33]. We decided to include the number of CAG repeats, as age at onset is subject to some degree of subjectivity. Likewise, rather than using the performance in one task to explain performance in another task (e.g. using motor score to explain TFC [34]), we limited the set of eligible variables to demographic variables. Finally, we did not include handedness in our models, because 90% of the patients were right-handed. This approach made it possible to include a larger number of covariates in our models than in those of previous studies and to prioritize them through the selection algorithm. For example, the number of CAG repeats has been reported to affect general verbal and spatial abilities [35], whereas our stepwise selection suggested that it was predictive of performance in the 3-figure cancellation task, which has a spatial nonverbal component. Indeed, the number of CAG repeats was found to have less impact than the sex of the patient in verbal tasks (letter and categorical fluencies) and sex was not included in the model described in the previous study. Furthermore, dichotomization of the number of CAG repeats variable (small and large numbers of repeats) may have resulted in greater importance being assigned to this variable than in models, such as ours, in which the number of CAG repeats was treated as a continuous variable. Like Ruocco [36], Kieburtz [37] and Feigin et al. [38], we showed that the number of CAG repeats improved in the prediction of motor performance, but not TFC. Finally, higher education levels were associated with a better performance, for all HVLT components.
The small number of patients enrolled in the MIG-HD study is a potential limitation in the search for predictive factors for future studies. However, external validation on the RIL-HD cohort, through calculation of the intraclass correlation coefficient and the determination coefficient (R 2 e ), demonstrated the reproducibility and robustness of our models, regardless of the differences between the two trials. Indeed, patients in the MIG-HD trial were not randomized until one year (A 3 ), whereas those in the RIL-HD study were randomized at the second assessment (A 2 ). Consequently, the patients in the MIG-HD study approached the intervention with greater hope, whereas those in the placebo group of the RIL-HD study may have been aware of a lack of improvement during the follow-up period. This difference may account for the poor prediction of behavioral performance in the RIL-HD study (R 2 e < 0). By contrast, the difference in time interval between A 1 and A 2 in the two studies had no impact on prediction quality, further demonstrating the validity of the models. The models were constructed with data from patients with relatively mild disease. They may, therefore, not be applicable to patients with more advanced HD. Indeed, retest effects would be expected to be smaller in patients with more severe disease.
Our findings indicate that the retest effect is a limitation in clinical trials, but that both its neutralization, through the use of a second assessment as a baseline, and its integration into task modeling would be beneficial in future trials. For example, our predictive models may facilitate the identification of rapid decliners [39], defined as individuals whose observed performance is worse than predicted. Indeed, in longitudinal clinical trials, treatment effects could be masked in such patients, as already shown for Alzheimer's disease [40]. The identification of such patients is helpful for trial design, in two ways. First, the exclusion of such patients would probably decrease intersubject variability, making it possible to decrease sample size. Second, rapid decliners could be uniformly allocated to the different arms of the study by stratified randomization, to ensure the constitution of comparable groups, in terms of both baseline data and disease progression.
Our findings suggest that the retest effect is detrimental, if uncontrolled, in clinical trials for neurodegenerative diseases, such as Huntington's disease. We show here that if two assessments are performed a short time apart, use of the second assessment as the baseline increases the chances of detecting an effect of treatment, if there is one. In addition, including the retest effect in models renders the resulting models more predictive, making it possible to refine the design of future trials. This constitutes a great stride forward in cognitive assessments in clinical trials.  Table. M matrix for calculating the 95% prediction interval for performance at A 3 for each task. (DOCX) S1 Text. Statistical explanation for calculation of the 95% prediction interval for performance at A 3 , for each task. (DOCX)