Room for Improvement in Conducting and Reporting Non-Inferiority Randomized Controlled Trials on Drugs: A Systematic Review

Background A non-inferiority (NI) trial is intended to show that the effect of a new treatment is not worse than the comparator. We conducted a review to identify how NI trials were conducted and reported, and whether the standard requirements from the guidelines were followed. Methodology and Principal Findings From 300 randomly selected articles on NI trials registered in PubMed at 5 February 2009, we included 227 NI articles that referred to 232 trials. We excluded studies on bioequivalence, trials on healthy volunteers, non-drug trials, and articles of which the full-text version could not be retrieved. A large proportion of trials (34.0%) did not use blinding. The NI margin was reported in 97.8% of the trials, but only 45.7% of the trials reported the method to determine the margin. Most of the trials used either intention to treat (ITT) (34.9%) or per-protocol (PP) analysis (19.4%), while 41.8% of the trials used both methods. Less than 10% of the trials included a placebo arm to confirm the efficacy of the new drug and active comparator against placebo, and less than 5.0% were reporting the similarity of the current trial with the previous comparator's trials. In general, no difference was seen in the quality of reporting before and after the release of the CONSORT statement extension 2006 or between the high-impact and low-impact journals. Conclusion The conduct and reporting of NI trials can be improved, particularly in terms of maximizing the use of blinding, the use of both ITT and PP analysis, reporting the similarity with the previous comparator's trials to guarantee a valid constancy assumption, and most importantly reporting the method to determine the NI margin.


Introduction
In the drug development process, the randomized controlled trial (RCT) can have a superiority, equivalence or a non-inferiority design. A superiority trial aims to demonstrate the superiority of a new therapy compared to an active comparator or a placebo, while an equivalence trial aims to demonstrate that a new therapy is equivalent (within margins) to its active comparator. In noninferiority (NI) trials, the aim is to show that the new treatment is not worse than the comparator, which typically is an active drug.
NI trials can be used in a situation when a new drug considered has a similar efficacy profile as its comparator but may offer other advantages over the existing drug such as a novel method of administration or a better safety profile. In a regulatory setting, NI trials can be used to provide primary, but indirect, evidence of efficacy of a novel drug in cases where a placebo control treatment is not ethically justified. [1,2] Critics have pointed at various drawbacks of NI trials, questioning whether they are really useful. Some argue that NI trials only benefit pharmaceutical industry as they allow drugs without additional clinical efficacy to enter the market. [3,4] However, as argued by Jones et.al, in some cases the new treatment may have no direct advantage but may present an alternative or second line therapy. [5] From a methodological perspective, compared to superiority trials, NI trials have methodological issues in design and analysis that can influence proper inference. First, the value of blinding in NI trial is under debate, especially if the endpoints are subjective. [6] In a superiority trial, a blinded investigator who has a preliminary belief in superiority of the test drug cannot manipulate the results to support his belief. On the contrary, in an NI trial, the blinded investigator with a preliminary belief in non-inferiority of the test drug can bias the result by assigning similar ratings to the treatment responses of all patients. Others argued that blinding is still important to show the differences between drugs in NI trials. [7] Second, there are different methods to determine the NI margin and there are debates on whether the margin should be determined based on statistical or clinical considerations. Third, although there is a degree of consensus that non-inferiority should be shown for both the intention-to-treat (ITT) and per-protocol (PP) analysis sets, it is not clear whether this will be conservative or anti-conservative in a particular situation. [6,7] Fourth, a difficulty in interpreting NI trials is their lack of ability to distinguish an effective drug from an ineffective drug i.e. assay sensitivity [7,8], without relying on evidence outside the trial. A drug is considered effective if it shows a significant treatment effect compared with placebo. An additional placebo arm is recommended to confirm assay sensitivity [2,6,9]. However, this is often impossible due to ethical reasons. Last, the validity of the historical data that was used as the reference for the current trial, i.e. constancy assumption, is a critical point in the interpretation of NI trials. Related to the last issue, the CONSORT statement has recommended authors to mention whether the eligibility criteria, interventions and outcomes are identical or very similar to any trial that established efficacy of the reference treatment. [10] The effort is encouraged to support the validity of the constancy assumption.
The ICH E9 [11], the ICH E10 [8], CHMP guidelines [12] and the extension of the CONSORT statement on NI trials [10] are the currently available guidelines for the appropriate conduct and report of NI trials. We summarized the guidelines' recommendations on the five methodological issues described above in Table 1. Furthermore, we included the FDA draft guideline on NI trials [13] in Table 1 for consideration. The draft FDA guideline is not in effect yet and still open for changes (as per 18 th March 2010).
In this review, we described how published NI trials were conducted and reported, and whether the standard requirements from the guidelines were followed.

Search strategy and publication selection
We performed searches for NI trials in PubMed on 5 February 2009 and retrieved 669 articles as described in Figure 1. Subsequently, based on pragmatic consideration rather than formal sample size calculation, we used SPSS 16 to select a random sample of 300 articles. We subsequently excluded study design papers, reviews, trials using healthy volunteers, non-drug trials, non-RCTs, and articles of which the full-text version could not be retrieved. If one article reported multiple trials, we analyzed the trials separately. If multiple articles reported the result of one trial, we considered them as one subject, and included only the first publication.

Data extraction
To extract relevant data, we created a standardized data extraction form, accompanied by an operational definition of each extracted variable. GW extracted all articles and MK extracted a randomly chosen 10% of the articles. GW and MK then compared the extraction results from the 10% articles. Disagreements occurred in seven articles and in three of 38 variables. The cause of the disagreements was the interpretation on vague information listed in the articles. We then decided that only a literal extraction was allowed, thus disallowing interpretation during extraction. For example for the degree of blinding, if only the description on how the investigator did the blinding but no clear terms e.g. double blind were written in the articles, we categorized it as 'ambiguously stated'. We then updated the operational definition accordingly and GW rechecked the extraction results of those three variables in all the articles again and if necessary revised them.
For any missing information, if the articles referred to a registration database or previous paper for full description of the methods, information from these sources was retrieved. NI margin -An acceptable non-inferiority margin should be defined (ICH E10, CPMP/EMEA 2000) -Should be pre-specified, and can be no larger than the presumed entire effect of the active control in the NI trial (draft FDA guideline on NI trial 2010) -Should be specified in publication (CONSORT statement extension, 2006) Method to determine NI margin -The determination of the margin in a non-inferiority trial is based on both statistical reasoning and clinical judgment (ICH E10) -Margin is chosen by defining the largest difference that is clinically acceptable, so that a difference bigger than this would matter in practice (CPMP/EMEA 2000) -The NI margin should be generally identified based on previous experience in placebo-controlled trials of adequate design under conditions similar to those planned for the current trial, but could also be supported by dose response or active control superiority studies.(ICH E10, CHMP/EMEA 2005) -Fixed margin method (two CIs method) is recommended. It is referred to as fixed because the past studies comparing the drug with placebo are used to derive a single fixed value for statistical margin, even though this value is based on results of placebo-controlled trials (one or multiple trials versus placebo) that have a point estimate and confidence interval for the comparison with placebo. This approach is relatively conservative, as it keeps separate the variability of estimates of the treatment effect in the historical studies and the variability observed in the NI trial, and uses a fixed value for the estimate of the control effect based on historical data (the 90% or 95% CI lower bound), a relatively conservative estimate of the control drug effect. Characteristics of the trials From each article, we extracted information on the journals' impact factor, type of drug, phase of the trial, trial's sponsor (independent investigator, pharmaceutical industry, or government), trial's design, primary endpoints, sample size, and the trial's conclusion of the new drug.
In addition, we extracted specific information whether the authors mentioned any additional benefit of the new drug and whether the additional benefit was addressed in the trial. For example, if the author mentioned that the additional benefit of the new drug was its better safety profile, we evaluated whether any formal safety profile comparison was included in the results section of the article.
We classified the journals based on their impact factor listed in the Journal Citation ReportsH (JCR) 2008 edition. We arbitrarily chose a cut-off point of ten to classify the journal as high or low-impact.
We extracted the phase of the trial according to the statement in the publications or the referred clinical-trial's database e.g. clinicaltrials.gov. The classification was Phase I, II, III, and IV. Phase II and III might be divided into 2 parts, A and B. Phase IIA's primary aims are assessment and exploration of efficacy and pharmacodynamic aspects of the drug in patients with the target disease. In phase IIB, the main objectives are to confirm efficacy in a relatively large group of patients and determine optimal dose and dosing regimen to be implemented in phase III trials. In phase III trials, the main objectives are to confirm and to gather the additional information about the effectiveness and safety of the drug that are needed to evaluate the overall benefit-risk profile of the drug. Phase IIIA is conducted prior to application for marketing authorization, while phase IIIB is conducted after application. [14,15] We classified the type of primary endpoints as hard endpoints, intermediate endpoints and subjective endpoints. Hard endpoints were direct clinical events, such as mortality or stroke; intermediate endpoints were indirect outcome measurements that might not necessarily have a direct relationship with the clinical event such as laboratory data or biomarkers; and subjective endpoints are endpoints based on subjective perspectives of investigator or patient, such as quality-of-life questionnaires.
We extracted from the article specific characteristics of NI trials: degree of blinding, the method to determine the NI margin, the type of analysis, the use of a placebo arm to confirm assay sensitivity, and whether the authors discussed the constancy assumption. Furthermore, we extracted reasons for not including a placebo arm.
In terms of blinding, we extracted the literal term reported by the authors in the manuscript and classified the blinding into openlabel, single, double, triple and ''ambiguously stated'' blinding. Since there are guidelines on the NI margin for anti-infective drugs, we assessed within these trials whether their NI margin was consistent with them. Based on a guideline of the FDA (1992) and CPMP (1997), the recommended NI margin for anti-infective drugs is percentage difference of 10-20%.
We analyzed the quality of conducting NI trials by comparing the design and analysis characteristics of the trials reported in the high-impact vs. low-impact journals; and between the trials that were sponsored by industry and non-industry.

Quality of reporting
To evaluate the quality of reporting, we compared the requirements from the extension of the CONSORT statement for NI and equivalence trials [10] between articles published before and after 2006 to evaluate the impact of the CONSORT statement extension on the reporting of NI trials. According to the extension of the CONSORT statement for NI trials, the method section should include additional information on how identical the inclusion and exclusion criteria, type of interventions and outcomes to previous efficacy trial of the active comparator were. The additional information should also include the NI margin and the method to determine it, sample size calculation, and whether a one-sided or two-sided confidence interval (CI) was used. The side of the CI is important in an NI trial as its inference of noninferiority is based on the CI of the treatment difference between the test drug and its comparator. NI is concluded when the CI excludes and lays beyond the NI margin. [11] Furthermore, we compared the compliance to the CONSORT statement extension's requirements between trials reported in the high-impact and low-impact journals; and between the trials that were sponsored by industry and non-industry.

Data analysis
Data were entered into a database using Epidata 3.1 (EpiData Association, Odense, Denmark; www.epidata.dk) and all statistical analyses were performed using SPSS 16 (SPSS Inc, USA; www. spss.com). The p-values for the differences were calculated using the Chi-square or Fischer's Exact test.

Selection of the trials
The selection process of the NI trials is outlined in Figure 1. After filtering the articles based on the exclusion criteria, we included 227 articles in the analysis, which referred to 232 trials. One hundred eleven (47.8%) trials were published after 2006, the year in which the extension of the CONSORT statement on NI trials was published.
The missing data we retrieved from the registry were mostly data on the trial's phases and sponsorship. We only referred to the database as suggested by the author, so we believe the data in the register were reliable. We retrieved data of 34 trial's phases from clinicaltrial.gov; data of one trial's phase and data of one trial's sponsor from ISCRTN; data of one trial's sponsor from WHO international clinical trial registry; and data of one trial's phase from a sponsor clinical-trial registry website.

The general characteristics of the trials
The general characteristics of the trials are described in Table 2. Most of the trials were published in low-impact journals (84.5%). Anti-infective drugs were the most studied drugs (22.9%).
Almost one-third (29.7%) of the studies were phase III studies and the majority had pharmaceutical industry involvement in their trial process (73.7%).
Almost all studies had a parallel design (93.1%), and both hard and intermediate endpoints were often investigated. Variability between studies in the ratio of number of subjects in the analysis population versus the planned number of subjects was considerable. Most of the trials concluded that the new drug was shown to be non-inferior compared with its comparator (209 trials -90.1%). In 124 trials (53.4%), the authors mentioned additional advantages of the new drug. Most of the additional benefits mentioned and addressed were in terms of the safety profile of the drug, as shown in Table 4.

The quality of conducting NI trials
The design and analysis characteristics of the trials are described in Table 3, while stratification according to journal impact factors is shown in Table 5. Six journals did not have their impact factor listed in the JCR 2008 edition and were not included in the analysis. We found no significant difference in terms of trials' characteristics between trials that were sponsored by pharmaceutical industry or not (data not shown).
More than half of the trials were stated as double blinded, while a substantial number (79, 34.0%) was open label. We found no difference in terms of blinding method between trials that were published in high-impact or low-impact journals.
We observed that 227 (97.8%) trials reported their NI margin in the articles. Nevertheless, only 106 (45.7%) trials reported the method by which the NI margin was determined. In 51 (22%) trials, the margin was determined merely based on investigator's assumption. In 20 (8.7%) trials, the NI margins were obtained from other publications or reviews. In 18 (7.7%) trials, the NI margins were obtained from guidelines and in 17 (7.3%) trials the NI margins were calculated by the investigators based on data from previous trials. Among the last, 15 of them used a preserved fraction of 50% or greater. We also found in 95 (40.9%) trials, the authors mentioned that the NI margin was a clinical acceptable margin. Among them three trials mentioned that the decision to use the margin was validated by a panel of clinical experts. We found no difference in terms of method to determine the NI margin between trials that were published in high-impact or lowimpact factor journals.
Within 53 anti-infective trials, most of the trials (42, 77.8% of all anti-infective trials) used an NI margin of percentage difference between 10 to 20%. Only four trials used a NI margin less than 10% or more than 20%. In the rest of seven trials, six trials did not use percentage difference as an NI margin, and in one trial, the NI margin was not clear.
In terms of statistical analysis, most of the trials (127, 54.7%) used either ITT or PP, while 97 (41.8%) trials used both ITT and PP analysis. We found among the trials that used both ITT and PP analysis, 94 of them concluded that the new drug was non-inferior to its comparator. In 53 trials of the latest, the conclusions were deducted from similar results of both ITT and PP analysis. In the rest of the trials: 22 trials concluded non-inferiority based on the results of their PP analysis; 18 trials were based only on the results of their ITT analysis; while in three trials, it was not clear on which analysis their conclusion was based. We found a significant difference in terms of type of statistical analysis between the trials published in high-impact and low-impact factor journals. Trials published in the high-impact journals mostly used only ITT analysis (54.3% of 46 trials), while in the low-impact journals, both analysis methods were most frequently used (44.4% of 180 trials).
In our review, we observed that 210 trials (90.5%) did not include a placebo arm to confirm assay sensitivity. Only 19 trials mentioned the reason why a placebo arm was not included in trials, and almost half of them were due to ethical reasons. We observed that the inclusion of a placebo is quite common (28.6%) in trials with neurology/psychiatric drugs. This is probably because in this type of drugs, the constancy assumption will often not hold, as the placebo effect in previous placebo-controlled trials is difficult to rule out. In addition, we found no difference in terms of using a placebo arm to confirm the assay sensitivity between trials that were published in high-impact or low-impact factor journals.  Note: *Percentage is based on 124 trials that mentioned any additional benefit of the new drugs irrespective of whether or not data were shown to support the claim.

"
The authors show any analysis or argument of the additional benefit. Additionally, we observed only nine (3.9%) authors discussed the constancy assumption and there was no difference in this respect between trials that were published in high-impact or lowimpact journals.

Compliance in reporting NI trials
Only 3.0% of the trials reported the similarity of the inclusion and exclusion criteria with previous trials studying the effect of the active comparator, 5.6% of the trials reported the similarity of the type of intervention with previous trials, and 3.4% of the trials reported the similarity of the outcomes. Seventy-seven (33.2%) trials did not report whether they were going to present the data using one-sided or two-sided CI in the methods section as required by the CONSORT statement. Furthermore, we found that the papers in low-impact journals reported the side of the CI more frequently than those in the high-impact factor journals, and the difference was significant ( Table 6).
The compliance in reporting the items required by the extension of the CONSORT statement before and after 2006 is described in Table 6. We did not observe improvement of reporting after the release of the CONSORT statement extension for NI trials. The method of determination of the NI margin was even reported less frequently in trials published after 2006 than in trials published before and in 2006.

Discussion
In this review, we found five main issues in the design, analysis and reporting of NI trials. First, many of the trials were open label  trials. Second, reporting the method to determine the NI margin was infrequent and limited. Third, most of the trials analyzed their data with one statistical analysis method; ITT or PP. Fourth, we observed that only few trials included a placebo-arm to confirm assay sensitivity and that only few trials discussed the constancy assumption. Lastly, we did not observe any difference in terms of reporting in NI trials published before or after the release of the extension of the CONSORT statement for NI trials in 2006.
In our review, about a third of the trials were open label trials. This surprising finding was not consistent with the guidelines [8,11] that suggest to use blinding whenever possible to minimize the risk of bias. This leads to discussion on the importance of blinding in an NI trial. Snappin believes that blinding only gives minor protection in NI trials, since a blinded investigator with a preliminary belief in non-inferiority of the test drug can bias the result by assigning similar ratings to the treatment responses of all patients. [16] There is no doubt, however, that blinding does offer protection against information bias. In addition, there will usually be endpoints (e.g. safety) for which differences are expected and for which blinding will ensure stronger evidence. We therefore conclude that blinding is still important in NI trials to avoid bias. If blinding is not possible, subjective endpoints need to be avoided and more stringent monitoring should be conducted.
The method to determine the NI margin was not reported in more than half of the trials. This finding is consistent with previous reviews in 2005 to 2006, where the methods were presented in 46% or less of the trials. [17,18,19] Apparently, the extension of the CONSORT statement in 2006 has not brought any significant impact yet. Furthermore, the statement has suggested that the NI margin should be preferably justified on clinical grounds and its relation to the effect of the reference treatment relative to placebo in any previous trials should be noted. [10] We found that most of the authors included a statement that the NI margin was a clinically acceptable difference, but only three trials mentioned that the margin was validated by a panel of clinical experts. This finding was consistent with other reviews [17,18,19], where many trials claimed that their margin was clinically relevant without any clear details how the clinically acceptable NI margin was chosen. Putting merely a statement that the margin was determined based on clinically acceptable difference is not sufficient for any subsequent trial replications. Thus, more details are needed in the description on how the NI margin was determined. Furthermore, a detailed description on how the margin was determined can help the reader to decide whether the NI margin and the rationale for the margin's choice influenced the validity of the results.
We observed in anti-infective drug trials, that most of them used a constant difference of 10-20% in treatment difference as their NI margin. Regulators recommend an NI margin of 10% for vaccines and anti-bacterials. [20,21] This margin of 10% is acceptable as long as the primary outcome of interest has a high incidence rate. The implication of using a 10% constant margin in vaccines and anti-infective drugs should be further explored and any improvement on the guidelines to determine NI margin should cover this issue.
We observed that most of the trials reported the result only from ITT analysis or PP analysis. Our results were consistent with a previous review that observed that more NI trials used ITT rather than PP. [18] We also observed that ITT analysis was more reported in high-impact journals. The CPMP guidelines and the new draft FDA guidelines for NI trials already stated that both analyses have equal importance in NI trials. For superiority trials, ITT analysis is the preferred analysis as it adheres to randomization [11] and might best reflect clinical practice. PP analysis might violate randomization and not reflect clinical practice very well. Several reviews with RCT simulation showed that both ITT and PP could be problematic in NI trials, especially if the trial had large number of non-compliance. [22,23,24] In addition, in our data, we did not observe any evidence that ITT will lead to more NI conclusions than PP. We conclude that both analyses are equally important, as each approach brings a different interpretation for the drug in daily practice.
We observed that only a small number of trials included placebo arms to support assay sensitivity. Although our data did not provide sufficient evidence whether the use of placebo was appropriate or not in the trials, we believe that the use of a placebo arm was probably not ethically feasible in most studies. Nonetheless, the non-inferiority result of the drugs in NI trials might bear two meanings: both drugs are equally effective, or both drugs are equally ineffective against placebo. In this sense, a placebo arm in an NI trial will enable evaluation whether both drugs in the trial are effective, if the trial shows non-inferiority. Alternatively, if the use of a placebo arm is not possible, the trial should choose a margin that assures that the estimated effect of the new drug is likely to be superior to placebo, under the constancy assumption for the active comparator. The readers, not only the investigators, also need to be aware of this issue of assay sensitivity in interpreting the result of NI trials. They need to consider the type of endpoints; the number of patients in the final analysis; reasons of patient's dropouts; the similarity of the trial with the previous trial(s) that established the efficacy profile of the comparator; and the constancy assumption of the data used as reference for the NI margin. Based on our review, two of the latter were only being reported in a small numbers of the articles.
Less than five percent of the trials in our review mentioned whether the trials were designed similar to relevant past trial(s). Thus, it was difficult to assess whether the historical data that were used for determining the NI margin were reliable. Since the validity of the NI margin is related to the interpretation of the NI trials, clear reporting of the method of NI margin determination and the constancy assumption is essential for every NI trial publication. It is impossible to check the validity of the constancy assumption without a parallel placebo arm. However, at minimum, it is possible to check whether the current NI trial was similar to previous trial(s) that estimated the efficacy of the active comparator. [25] We found no difference between reporting before and after the release of the extension of the CONSORT statement on NI trials. Furthermore, in general, there is no difference in adherence to the CONSORT statement between the high-impact and the lowimpact journals. The overall low adherence to the statement might be due to unfamiliarity of the authors, referees, and editors of all of the journals with the statement extension. Researchers and editors of journals should be more aware of this extension and should comply with its recommendations. We realized that it might be too early to see full adherence of the CONSORT statement extension after 3 years, but due to the reputation of the CONSORT statement itself, we considered it reasonable to expect a certain degree of improvement.
Our review has some limitations. First, we excluded several trials since we only used a random sample of all NI trials that we identified. However, as this was a random sample, this will not have influences our results. Second, we only used PubMed to identify NI trials; therefore, we might have missed some trials. However, we assume that NI trials retrieved from PubMed do not have different methodological characteristics than NI trials in other databases, so we do not think that this influenced our results. Third, since the terms that we used to search for non-inferiority trials were not standard MESH terms and our search for those terms was limited to the abstract of the articles, our search might not have captured all NI drug trials available in PubMed. Also for this selection, we expect that the NI trials that we found are not different from the NI trials that we did not capture with our search. A strength of our study is that we did not only focus on the NI margin, as previous reviews [17,18,19] did, but also evaluated other methodological aspects of NI trials. In addition, we evaluated the quality of reporting using the current guidelines from the CONSORT statement.
In conclusion, the conduct and reporting of NI trials can be further improved. Particularly, in terms of maximizing the use of blinding, the use of both ITT and PP analysis, reporting the similarity with the previous comparator's trials to guarantee a valid constancy assumption and reporting the method to determine NI margin.