Reliability in long-term clinical studies of disease-modifying therapies for relapsing-remitting multiple sclerosis: A systematic review

Background Although relapsing-remitting multiple sclerosis (RRMS) has a chronic course, little information is known about the comparison between the disease-modifying therapies (DMT) for long-term outcomes. We aimed to conduct a systematic review of randomized clinical trial (RCT) extension and observational studies to examine the efficacy and safety of all available DMT for RRMS, compare the evidence with that derived from mid-term studies, and investigate whether the published long-term data are robust and reliable enough to inform clinical decision-making concerning RRMS treatment. Method PubMed, Scopus, and manual searches were performed until October 2019. The clinical outcomes of long- and mid-term studies were compared. ROBINS-I was used to assess the methodological qualities of the long-term studies. PROSPERO number CRD42019123361. Results Nineteen long-term studies (9,018 participants) were included in the systematic review. All studies presented serious or critical risks of bias that were mainly due to confounding, selection, and missing data biases. The annualised relapse rates (ARR) observed in the long-term studies are lower (better) than those from the mid-term studies for most treatments. The main reason for this ARR decrease could be a selection bias for good responders in the long-term studies, since many studies show a loss of patients between the mid- and long-term phases. The safety profiles depend on the study, follow-up, report, and outcome (i.e., discontinuation or number of patients with at least one serious adverse event). Conclusion The currently available long-term data for patients with RRMS exhibit serious or critical risks of bias that preclude robust comparisons between long-term studies. High quality comparative observational studies with long-term follow-ups or RCT extensions with intention-to-treat analyses are needed to support clinical and regulatory practice. Until reliable long-term evidence is available, neurologists should continue to base their conduct on mid-term studies, patient`s experience and, most importantly, patient`s needs and predictor factors, according to personalized medicine.

gov) and the reference lists of reviews and included studies were also searched. Complete search strategies are provided in S2 Table in S1 Appendix.
We included studies that fulfilled the following inclusion criteria according to the PICOS acronym:

Population
Patients aged 18 years and older diagnosed with RRMS; studies evaluating RRMS with other forms of multiple sclerosis (i. e. clinically isolated syndrome, primary progressive multiple sclerosis or secondary progressive multiple sclerosis) were excluded.

Outcomes
Annualised relapse rate (ARR, which is the primary outcome of most of the mid-term studies [6]), discontinuation due to adverse events (DAE), and the number of patients with at least one serious adverse event (SAE).

Studies
Prospective, or retrospective comparative cohort studies, randomised phase II or later controlled trials (including post-hoc analyses), and multi-or single-arm extensions of RCTs with at least 36 months of follow-up. Equivalence studies were excluded.
For studies that evaluated a switch in therapy, we included only the arms with at least 36 months of continuing follow-up. Studies that considered at least one of the aforementioned outcomes were included.
Two researchers independently screened the titles and abstracts of the retrieved studies to identify irrelevant records. In a second stage, full-text articles were also independently evaluated by two researchers according to the inclusion and exclusion criteria. Discrepancies were reconciled in consensus meetings using a third researcher as a referee.
The following data were independently extracted by two researchers: (i) study characteristics (authors' names, year of publication, trial design, sample size, evaluated DMT, mean follow-up, diagnostic criteria, and sponsor); (ii) baseline data (patients' sex and age, disease duration, or symptoms onset); and (iii) clinical outcomes.
The baseline data and clinical outcomes of the long-term studies (� 36 months) were compared with those from mid-term studies (> 3 and < 36 months) that were recovered from a recently published systematic review [6]. The data were tabulated according to the ARR and standard deviation; when a study reported the confidence interval, it was converted to a standard deviation. The DAE and SAE were reported according to the number of patients with the outcome, sample size, and percentage.
The critical evaluations of the risks of bias of the studies were conducted by two independent reviewers using the Risk of Bias in Non-randomised Studies of Interventions (ROBINS-I) tool [11]. In the absence of consensus, points of disagreement were resolved by the opinion of a third researcher. The risks of bias of the mid-term studies were assessed using the Cochrane Collaboration revised Risk of Bias assessment tool [12], and the results have been published in a previous systematic review [6].

Results
Our systematic review identified 1,760 records in the electronic databases after duplicate removal and obtained two by manual search. Of these, 1,699 were considered irrelevant during the screening, and 38 were excluded during the full-text appraisal (Fig 1 and S3 Table in S1 Appendix). The remaining 25 records (19 studies) comprised 14 RCT, and five observational studies and were included in the qualitative synthesis (S4 Table and S7 Table in S1 Appendix). The articles were published between 2003 and 2018. In total, 9,018 participants (median: 147; interquartile range: 83-249) were included, and 5,468 (60%) were women (three studies did not mention the proportion of patients' genders). Five studies (26%) evaluated a switch in therapy. Altogether, 14 dosages of DMT were identified, six (32%) studies compared active therapies (head-to-head), seven (37%) compared doses, five (26%) were non-comparative, and one (5%) evaluated the active treatment against no treatment. No studies assessing natalizumab, ocrelizumab, or teriflunomide fulfilled the inclusion criteria.
A qualitative comparison of the mid-and long-term baseline data revealed they were very similar, except ATTAIN that included a population with relapses within the previous 2 years  Table 1 (additional characteristics are presented in S4 Table in S1 Appendix).
Additionally, 28 mid-term RCTs were considered for comparison, and their characteristics have been previously reported [6]. In summary, the mid-term articles were published between 1995 and 2018 with a median of 2011. Most of the studies included both treatment-naïve and treatment-experienced patients 12 (40%) or did not report this information 11 (38%), 6 (20%) included only treatment-naïve participants, and 1 (3%) assessed only treatment-experienced patients. Most of the studies had a follow-up of 2 years (median 2; interquartile range: 1-2).
The methodological qualities of the long-term studies are presented in S5 Table in S1 Appendix. All studies were found to have serious or critical methodological problems. The non-comparative RCT extension studies were all deemed to have critical risks of bias because the lack of a comparison group automatically precludes the comparability of such a study to an RCT (the gold standard), and the ROBINS-I questions assess comparability between groups, whether concerning baseline characteristics or concerning patient follow-up. All comparative RCT extensions and cohort studies presented with serious risks of bias, and the following domains were primarily responsible for these classifications: 'bias due to confounding factors', 'selection bias', and 'missing data bias'. Most of the studies did not report any attempt to control key confounders (e.g., adjusting the analyses), which limits the comparability between arms. Most studies only included the patients who tolerated the drug and did not discontinue the treatment during the core study into the extension phase. Many studies also lacked missing data management, which varied between 0% and 83% of the dropout rate.
The methodological qualities of the included mid-term studies were recently published [6]. In summary, most of the studies presented a 'low risk of bias' (58%), which was followed by 'some concerns' (25%). The domain that most frequently scored as a 'high risk of bias' was the measurement of the outcome (due to the lack of the masking of the assessors).
The safety scenario was less consistent; the different safety profiles depended on the study, outcome evaluated (discontinuation or the number of patients with at least one serious adverse event), follow-up time, and outcome measure or report. The annual incidences of DAE and SAE were reported by 5 and 2 long-term studies, respectively, and the numbers of patients who presented with an event of DAE and SAE in the complete follow-up were reported by 8 and 9 long-term studies, respectively. The proportions of events were similar between the different treatment studies, but ALE12 and FING1.25QD (unapproved dose) exhibited reduced DAE from the mid-term to the long-term endpoints. Regarding SAE, CLA3.5 reported an increased proportion from the mid-term to the long-term (S6 Table in S1 Appendix).

Discussion
We investigated the long-term effects of DMT in RRMS through a systematic review of 19 studies (9,018 participants). Recent NMAs of DMT in RRMS [5,7,13] have been limited to RCTs that have reported only short-(< 3 months) and mid-term outcomes (> 3 and < 36 months). In our study, we aimed to more comprehensively summarise the clinical outcomes of DMT by expanding the follow-up to fully capture the comparative effect of long-term studies and demonstrated their limited value for supporting clinical decision-making and practice guidelines. The comparison of mid-term RCTs (i.e., the gold-standard) with long-term RCT extensions and observational studies (i.e., real-world data) aims to identify potential differences in outcomes that could be explained by population differences. Although it would be useful to have strong evidence about the long-term outcomes of DMT, our findings highlight the importance of being cautious when considering RCT extensions and observational studies to support clinical practice because of their important limitations that can compromise the validity of their evidence. Despite these limitations, some multiple sclerosis treatment guidelines usually consider evidence extracted from mid-and long-term studies, including extension studies, to support their recommendations [14]. Although MS neurologists expert base their conduct on the patient's experience or personalized medicine (i.e. patient's needs and predictor factors) [15,16], neurologists not expert in MS have a limited evidence to facilitate making decision, considering both clinical trials, observational studies and guidelines.
Thus far, there is no consensus regarding whether an RCT extension is an observational or an interventional study. The literature exhibits a tendency to classify these types of study as observational [17,18] because they do not start a new therapy, and more importantly, because, except CLARITY Extension [19], no appropriate randomisation exists at the beginning of the extension phase. Randomisation and masking are essential characteristics that guarantee the superiority of RCTs, but they are lost during an extension phase [18,20,21]. Thus, we decided to evaluate both cohort and RCT extension studies using the ROBINS-I tool in our systematic review. Our position is in agreement with the FREEDOMS researchers who registered a RCT extension as an observational study in ClinicalTrials.gov [22]. Unfortunately, other RCT extension studies that were included in our systematic review were registered as interventional or only mentioned the same NCT from an original RCT [19,[23][24][25][26][27][28][29]. Notably, even if RCT extensions were considered interventional studies, their methodological qualities, as assessed with a tool for RCT assessment, would result in a high risk of bias classification due to the lack of randomisation, awareness of the therapy by the assessors, missing data domains, and even because comparability is lost when only one arm is followed. The number of extension studies  Table 2. Comparison between mid-and long-term annualised relapse rate.

3-to 12-month 24-month 36 to 48-month � 60-month ARR (SD) [n] ARR (SD) [n-% of patients from the original study]
ALE12 has increased in the last decade despite the lack of standardisation of their methodological qualities, which compromises their reliabilities to inform clinical practice. The loss of randomisation is a special concern in long-term studies because the patients who enter the extension phase belong to a selected group that could tolerate [20] and positively respond to the therapy during the original RCT [21]. ATTAIN study is a good example since it is reported a frequency of relapse within the last 2 years of 0.36 and 0.45 for PIFN125Q2W and PIFN125Q4W groups,  respectively, which is very below than ARR reported by other DMTs. The confounding bias domain in the ROBINS-I assesses how a study deals with a lack of randomisation by adjusting for potential confounders, which is rarely performed. In observational cohort studies, adjusting for potential confounders is more frequent, but this was not the case in the majority of the studies included in our systematic review. Another concern due to the observational design or the extension of the clinical trial is the absence or loss of blinding patients and assessors. In the case of RRMS, the absence of blinding can be critical, since the main clinical efficacy outcomes are related to relapse, which is a subjective result, considering the range of different definitions for relapse. For example, some authors define that the relapse must last at least 24 hours [30], others 48 hours [31]; some authors define that relapse should increase � 1 point in two scores of functional systems (FSS) or � 2 points in an FSS [32], while others define relapse should increase � 1 in the score of the Expanded Disability Status Scale (EDSS) if the previous EDSS score was � 5.5 and � 0.5 if the previous EDSS score was � 6 [33]. Thus, these discrepancies between the definitions show how relapse can be considered a subjective outcome and, therefore, the patient or assessor awareness of the therapy can influence the assessment, contributing to different ARR results between mid-and long-term studies for the same DMT.
The lack of adjustment for covariates-as an observational study must guarantee, blinding and maintenance of randomization-as an experimental study must guarantee, may be some reasons for lower (i.e., better) ARRs reported in several of the long-term studies compared with their mid-term predecessors. For example, PRISMA presented ARRs of 1.82 for the midterm studies and 0.83 in the long-term phase. The main reason for this unexpected decrease could be a selection bias for good responders in the long-term studies after 12% of the patients were lost between the mid-and long-term phases. In our systematic review, a quarter of the studies had a dropout rate above 20% before the beginning of the extension phase. Hemming et al. proposed the use of intention-to-treat analysis with respect to the baseline group of patients entering into a RCT; i.e., they should be treated as a responder or non-responder depending on the reason for not continuing in the extension study [34].
Another potential reason for discrepancies in ARR reported for the same DMT among several studies is the lack of a common adjustment: while some studies adjust the ARR for EDSS [35], others consider age [36], sex or still an unadjusted analysis [37]. Additionally, data can be modelled by negative binomial regression [38] and others by Poisson regression [39].
Comparing mid-and long-term results across different studies is further compromised due to the variability in the starting point (i.e., mid-term). For example, IFNA22TIW presented an ARR that ranged from 0.70 in DMSG study to 1.82 in PRISMS study, and IFNA44TIW presented an ARR lower than 0.55 for all studies; however, PRISMS reported an ARR of 1.73. This important variability in efficacy in the mid-term studies can be explained by several differences in the conduction of these studies: DMSG is a study with high risk of bias, while PRISMS is a study with low risk of bias; PRISMS is an old study that used Poser's 1983 diagnostic criteria, whereas most studies assessing IFNA44TIW used the McDonald criteria (2001 to 2010), which might have resulted in different characteristics of the included patients when the more sensitive diagnostic criteria that allow for earlier diagnosis were used [40]. Differences in the proportions of patients with highly active or rapidly evolving severe conditions in the mid-and long-term studies could also explain these discrepancies, but most of the mid-and long-term studies did not report this information, since the terms highly active or rapidly evolving severe RRMS have been more used only in the last decade [41,42].
Differences in the risk of bias and population can also limit the comparability between longterm studies. For example, CARE-MS II (� 60 months) included only treatment-experienced patients and reported an ARR for ALE12 of 0.21, while CAMMS223 (� 60-month) included only treatment-naïve patients and reported an ARR for ALE12 of 0.12.
We identified a consistently higher proportion of patients with at least one SAE in the longterm compared with the mid-term studies. However, this result could be misleading because most studies did not define SAE and could consider multiple sclerosis relapses as a SAE instead of a therapeutic failure [43]. Another issue precluding the comparison between studies is the inconsistent manner of reporting safety outcomes. Some studies reported the annual incidence, while others reported the proportion of patients with a presenting event during the complete follow-up period. For example, CARE-MS II reported the number of patients who discontinued ALE12 each year together with the number of patients who continued ALE12 therapy each year. However, CAMMS223 reported only that five patients discontinued due to adverse events over the five years of study. For CAMMS223, the calculation of the incidence of adverse events rate is not possible. The only studies that reported safety outcomes as incidence per patient-year were CARE-MS I and II, and GALA.
One limitation of our study, as with any systematic search, is that missing studies could exist. However, a grey literature search found no additional studies, and only one additional study was found through the manual searches. These findings reinforce the quality of our search. We were unable to perform meta-analyses of the long-term outcomes because of the poor reporting of these outcomes in primary studies of DMT in RRMS.
In conclusion, the current available evidence regarding long-term safety and efficacy outcomes cannot sufficiently contribute to clinical decision making in patients with advanced RRMS because the studies have critical or serious risks of bias due to the inclusion of a selected population composed of good responders in both efficacy and safety. The conduction of high quality comparative observational studies with long-term follow-ups or RCT extensions with intention-to-treat analyses is needed to support clinical and regulatory practice. Until reliable long-term evidence is available, neurologists should continue to base their conduct on midterm studies, patient's experience in terms of effectiveness and safety and, most importantly, patient's needs and predictor factors, according to personalized medicine.