Regression to the Mean and Predictors of MRI Disease Activity in RRMS Placebo Cohorts - Is There a Place for Baseline-to-Treatment Studies in MS?

Background Gadolinium-enhancing (GD+) lesions and T2 lesions are MRI outcomes for phase-2 treatment trials in relapsing-remitting Multiple Sclerosis (RRMS). Little is known about predictors of lesion development and regression-to-the-mean, which is an important aspect in early baseline-to-treatment trials. Objectives To quantify regression-to-the-mean and identify predictors of MRI lesion development in placebo cohorts. Methods 21 Phase-2 and Phase-3 trials were identified by a systematic literature research. Random-effects meta-analyses were performed to estimate development of T2 and GD+ after 6 months (phase-2) or 2 years (phase-3). Predictors of lesion development were evaluated with mixed-effect meta-regression. Results The mean number of GD+-lesions per scan was similar after 6 months (1.19, 95%CI: 0.87-1.51) and 2 years (1.19, 95%CI: 1.00-1.39). 39% of the patients were without new T2-lesion after 6 month and 19% after 2 years (95%CI: 12-25%). Mean number of baseline GD+-lesions was the best predictor for new lesions after 6 months. Conclusion Baseline GD-enhancing lesions predict evolution of Gd- and T2 lesions after 6 months and might be used to control for regression to the mean effects. Overall, proof-of-concept studies with a baseline to treatment design have to face a regression to 1.2 GD+lesions per scan within 6 months.


Introduction
MRI related endpoints are established outcomes for proof-of-concept and phase II efficacy trials in relapsing remitting MS. [1] New T2 hyperintense lesions or Gd-enhancing lesions are accepted as best available biomarker for inflammatory disease activity. [2,3] Two different design strategies are available for early phase 2 studies. Beside small, short-term randomized placebo-controlled trials, a baseline to treatment design has been applied. [4,5] These studies usually analyse the reduction of new MRI-lesions under treatment with a 6 to 12 week untreated run-in phase. They provide advantages over classic larger placebo controlled trials: Recruitment of patients is easier as all patient receive the new treatment, sample sizes are usually smaller and costs are within a range that allows conducting such studies as investigator initiated trials. [6,7] While early placebo-controlled trials provide an initial estimate of the effect size, studies with a baseline to treatment design need to take regression to the mean effects into account. A 40% decrease of the annualized relapse rate has been observed in placebo cohorts of phase 3 trials. [8,9] But data about regression to the mean of MRI-endpoints have only be assessed in one study of limited size [6], but not systematically investigated across multiple trials. Sample size considerations for both designs mentioned above are determined by the assumed effect size and the event rate of the outcome. [10] Over the last 20 years, MS phase 3 trials showed a significant increase in sample size due to a lower event rates of relapses. Lower relapse rates are associated with higher age in more recent studies as well as with the establishment of new diagnostic criteria. [9,11,12] Another reason for low disease activity might be a selection bias, as more active patients with higher event rates tend to start with one of the numerous approved treatments. In how far MRI-endpoints for phase 2 trials share the same problems as relapse endpoints has not been investigated in depth. Increasing sample sizes and competitive recruitment raise costs and might jeopardize the feasibility of innovative and especially investigatorinitiated trials (IIT).
Meta-analyses of placebo cohorts are an established method to investigate the regression to the mean phenomenon and predictors for disease activity in MS. [9,[11][12][13]] Based on a systematic literature search, we aimed to quantify the regression to the mean effect of MRI-endpoints and to identify predictive variables, that might be used as inclusion criteria in future phase 2 trials with MRI endpoints.
English journal publications were reviewed to identify studies that met the following criteria: (1) placebo-controlled double-blinded phase-2 or 3 trials in MS with a follow up of at least 6 months, (2) MRI outcomes published, (3) exclusion of secondary-progressive MS (SPMS), primary-progressive MS (PPMS) and clinically isolated syndrome (CIS) patients and (4) published between 1980 and March 2013. Record selection, exclusions and inclusion of studies according to the PRISMA guidelines are presented in Fig. 1 and in the supporting information. [14] Data extraction The following data were extracted: the name of the first author, year of publication, study phase (2 or 3) and number of patients in the placebo cohort; baseline characteristics of these cohorts including mean age, mean disease duration, rate of females, mean pre-study relapse rate and mean EDSS; Gadolinium-enhancing (GD+) status and whether the McDonald criteria were applied. Outcomes of interest were mean number of new T2-lesions (newT2), rate of patients without new T2 lesions (T2free), mean number of GD+-lesions (meanGD), rate of GD+-free patients (GDfree). All outcomes were collected for two time points. In case of phase-2 studies we defined month 6 (+-2months) data as the probably best available. Mean number of GD+-lesions per scan over 4-8 months were used as estimate for month 6, if single scan data were not available. From Phase-3 studies we extracted 24 months (+-2 months) data and Month 6 data if available. Throughout this paper, newT2, T2free, meanGD and GDfree are labelled as "outcomes" while publication dates, baseline values and definitions are referred to as variables. For all outcomes and variables standard deviations (SD) were also extracted if published. If not given, confidence intervals or standard errors were converted to SD. Standard errors (SE) for rates (T2free and GDfree) were calculated as proposed by Gelmann and Hill. [15] Two authors (KHS and KLY) reviewed the final dataset to minimize data copying mistakes.

Qualitative analyses
From a conceptual point of view, heterogeneity of studies is already high due to different trial designs and inclusion criteria. Rater blinding, different MRI sequence techniques as well as field strengths of MRI might increase the heterogeneity of studies. The method section of all publications was checked for the following information: sequence for T2 lesion identification, sequence details as slice thickness for T2/pd and T1, field strength of scanners in tesla, double blind rating, number of raters and GD-dosage. Information was quantified by counting published information for each study.

Statistical methods
For all continuous variables descriptive statistics such as mean, SD, median, and range were computed. We used random-effects meta-analyses to estimate means and 95% confidence intervals outcomes (95%CI) for each outcome. We compared means of GD+-lesions at month 6 and 24 with an unpaired t-test and calculated the mean difference and the corresponding 95% confidence interval. The rates of GD+-free patients at both time points were tested with a chisquare test.
We calculated I 2 (proportion of heterogeneity among true effects of total variability) and tau 2 (between-study variance). A detection of outliers was implemented according to Viechtbauer and Cheung. [16] Outliers were excluded from further analyses, but differences between models with and without outliers were investigated for relevant differences. All Forest plots are available in the supporting information. For the mixed-effect models we included each variable separately and calculated tau 2 and its relative change compared to the pure random effect model as a measure of association between variable and outcome. [11,17,18] In case complete data was available from less than 4 studies, analysis of the variable was skipped (rule of thumb). [17,18] In addition, we tested for residual heterogeneity in the mixed-effects models. In a final step, we calculated predictive models for all outcomes with the best overall variables. To correct for multiple testing only p-values <0.001 were considered statistically significant. All analyses were performed with the open-source software R including the Hmisc and the metafor packages. [17,19,20]

Results
We identified 21 published trials (10 phase-3) that met our inclusion criteria.  Baseline data of Phase 2 and 3 studies did not differ significantly, except for number of subjects, which was included as weight in all analyses. (Table 1) Only one phase-3 study presented comparable 6 and 24 months data. [36] Details about the study selection process according to the PRISMA guidelines [14] are summarized in Fig. 1 and the supporting information.
Qualitative synthesis revealed, that only one study published all necessary information [24] and 6 papers (29%) did not report any sought information. Median number of reported information was 3 out of 7. Only the sequence for identification of T2 lesions was reported in more than 50% of the publications. Concerning rater blinding and number of raters, we could only discriminate between studies that implemented 2 raters and double blind assessments from those who did not publish details. Just 9 (43%) of the papers mentioned the field strength, which might influence lesion detection. Findings are summarized in the supporting information.
According to the above-mentioned rule of thumb, only one explorative variable per model could be investigated. Results are summarized in Table 2. We found a statistically significant inverse correlation between the baseline mean GD+-lesion number and the rate of patients without GD lesions after 6 month (p<0.001). The mixed-effect model did not show residual heterogeneity. GD+-lesions at baseline showed a trend towards a positive association with the number of GD+-lesions and new T2 lesions after 6 months (p = 0.03 and p = 0.01) and reduced heterogeneity about 100%. New lesions at 6 months tended to occur more often in studies that used McDonald 2005 criteria than McDonald 2001 criteria (p = 0.002 and 0.02). For outcomes after 2 years, rate of females was positively correlated with mean GD+-lesions (p<0.001) and lower baseline EDSS was predictive for patients without GD+-lesion after 2 years. None of the investigated variables was predictive for the number of new T2 lesions after 2 years. Overall, number of GD+-enhancing lesions at baseline showed the best association with month 6 outcomes and was chosen for calculation of predictive models (Fig. 4). The mean number of GD+-lesions after 6 months can be estimated with the formula: GD+ month6 = 0.455+0.551ÃGD+ baseline .

Discussion
A mean number of 1.2 GD+-lesions might be expected in placebo cohort of RRMS after 6 months and as well after 2 years of follow-up. Regression to the mean seems to occur already in the first months of study participation, and might be negligible after 6 months. This is in line with previous findings from a small study that found a regression to 1.2 GD+-lesions within 6 months. [6] These findings provide evidence that baseline to treatment designs are feasible if carefully interpreted concerning regression to the mean effect. The relevance for this kind of studies has been shown by the development of BG-12 as MS treatment based on an investigator-initiated baseline-to-treatment study which now lead to market approval. [5] Comparison of phase 2 and phase 3 studies is possible, as we could not detect a significant difference of key baseline parameters between the two study sets. However, due to the low number of studies, the 95% confidence interval for the mean difference is still large (-0.39-039). We could quantify the overall amount of regression to the mean with 37%. This is similar to previous analysis of regression to the mean effect of relapse rates. [8,9] Corrected for different numbers of baseline GD+-lesions in different studies the effect was less (about 16%). In contrast to the well-known observation, that annualized relapse rates in RRMS trials decreased comparing earlier and recent treatment studies [9,11,12], we could not detect a similar pattern for MRI lesions. Two trials published in 2012 had even the highest number of new T2 lesion after 2 years [38,40] and the number of GD+-lesions per scan did not correlate with publication dates. This contradicts the previously shown association between GD+-lesions and relapses. [3] One reason might be that sensitivity for lesions detection increased with new MRI technologies as e.g. 3D-sequences or higher field strength and compensate an opposite effect of new diagnostic criteria. Unfortunately, our restricted data set did not allow evaluating the association between different MRI methods and lesion counts.  We observed a lower number of GD+-free patients after 6 months than after 2 years. This discrepancy might be explained by a long-term separation of RRMS patients into two groups. About a half of the patients is free from acute GD+ inflammatory activity after 2 years. The other half must have an on-going high inflammatory activity with more than 2 active lesions per scan to explain an overall mean number of 1.2 lesions per scan.
Only baseline GD+-lesions were predictive for 6 months outcomes and reduce between study variance below significance. After two years, this association is lost. This might explain why number of baseline GD+-lesions was not predictive for the annualized relapse rate. [11] In addition, it fits to natural history cohort data that could not assure a predictive value of GD-enhancing lesions for disability. [42] GD+-lesions are a good predictor for short term MRI disease activity and hence a valuable inclusion criterion for phase-II trials but maybe not for phase III trials. Further on, they can be used to estimate GD+-lesions after 6 months in baseline to treatment designs.
Newer diagnostic criteria tend to diagnose more patients with low inflammation, as it has been shown for relapse outcomes. [11,43] Our data show now an opposite trend, as the change from McDonald 2001 to 2005 criteria was associated with an increased number of new T2 lesions after 6 months. In contrast to previous meta-analysis baseline EDSS was predictive for one single outcome-number of GD+-lesions after 2 years. [11,12] Even though we adjusted our analyses for multiple testing by using a conservative p-value threshold of 0.001, this must be confirmed in future work based on more trials or individual case data. The association of inflammatory outcomes with sex and diagnostic criteria is more in line with previous studies. [11,12] However, Meta-analytic technics cannot clarify whether MRI disease activity predicts or correlates with relapse rate or disease progression. Only individual case data might solve this question and give information about possible predictors for on-going inflammatory disease activity.
Compared to meta-analysis addressing clinical endpoints as relapses, our research is probably more affected by random effects. Phase-3 trials show a relevant heterogeneity due to different eligibility criteria, different countries and relapse definitions. Inclusion of phase 2 studies with small sample sizes increases variability already but addressing MRI-endpoints will probably boost it. Due to methodical and technical innovation it is not clear, how comparable T2 lesions from 1994 and 2010 are. Nevertheless, we believe detection of T2 and GD+-lesions was robust enough to be compared based on random-effects meta-analytic technics. Other MRI outcomes as lesion volume, brain atrophy or even more advanced technics as diffusion tensor imaging measures could not be included. Up to now, reliable clinical or MRI outcomes of disability are still lacking. [1,44] Beside new outcome development, novel trial designs may help to reduce sample size and follow-up time. [45][46][47] This is especially important for investigator driven research, as those trials have to face a more and more competitive recruitment. An increasing number of new therapeutics and a decreasing general disease activity threatens new treatment approaches to be tested. Baseline-to-treatment studies might therefore still and Results of mixed-effects models with at least 4 studies. Outcomes are bold. Coefficients indicate positive or negative association, *(p<0.001). tau 2 is the estimate of residual heterogeneity compared to the simple random-effects model, § indicates no significant residual heterogeneity in the mixed effect model with p>0.05. A higher reduction of tau 2 within the mixed-effect models indicate a higher association with outcome. CI = Confidence interval. maybe even increasingly be the most feasible in terms of recruitment and effort approach for academic led treatment research.

Conclusion
Baseline number of GD-enhancing lesions is the best predictor for evolution of Gd-and T2 lesions after 6 months and might be used to control for regression to the mean effects. Overall, proof-of-concept studies with a baseline to treatment design have to face a regression to 1.2 GD+lesions per scan within 6 months.