Deficient Reporting and Interpretation of Non-Inferiority Randomized Clinical Trials in HIV Patients: A Systematic Review

Objectives Non-inferiority (NI) randomized clinical trials (RCTs) commonly evaluate efficacy of new antiretroviral (ARV) drugs in human immunodeficiency virus (HIV) patients. Their reporting and interpretation have not been systematically evaluated. We evaluated the reporting of NI RCTs in HIV patients according to the CONSORT statement and assessed the degree of misinterpretation of RCTs when NI was inconclusive or not established. Design Systematic review. Methods PubMed, Web of Science, and Scopus were reviewed until December 2011. Selection and extraction was performed independently by three reviewers. Results Of the 42 RCTs (n = 21,919; range 41–3,316) selected, 23 were in ARV-naïve and 19 in ARV-experienced patients. Twenty-seven (64%) RCTs provided information about prior RCTs of the active comparator, and 37 (88%) used 2-sided CIs. Two thirds of trials used a NI margin between 10 and 12%, although only 12 explained the method to determine it. Blinding was used in 9 studies only. The main conclusion was based on both intention-to-treat (ITT) and per protocol (PP) analyses in 5 trials, on PP analysis only in 4 studies, and on ITT only in 31 studies. Eleven of 16 studies with NI inconclusive or not established highlighted NI or equivalence, and distracted readers with positive secondary results. Conclusions There is poor reporting and interpretation of NI RCTs performed in HIV patients. Maximizing the reporting of the method of NI margin determination, use of blinding and both ITT and PP analyses, and interpreting negative NI according to actual primary findings will improve the understanding of results and their translation into clinical practice.


Introduction
Non-inferiority (NI) randomized controlled trials (RCT) are standard research methodology to demonstrate that a new experimental treatment is not worse than reference therapy (active comparator) in terms of efficacy. Human immunodeficiency virus (HIV) NI trials have emerged as the new standard design for HIV drug development in both antiretroviral (ARV)-naïve andexperienced patients [1]. Although increased efficacy rates of highly active antiretroviral therapies (HAART) have reduced space for newer antiretroviral agents with better efficacies [2], there is need for treatment simplification and newer alternative agents. This has led to a growing number of HIV NI trials in recent years.
The extended Consolidated Standards of Reporting Trials (CONSORT) statement of 2006 [3] has updated recommendations to guide the conduct and reporting of NI trials. Reports indicate the endorsement of CONSORT statement by journals is associated with significant improvement in the quality of reporting of RCTs [4]. However, there have been emerging concerns regarding deficiencies in adherence to guidelines and recommendations in design, statistical analysis and reporting of RCTs investigating NI [5,6]. An important aspect in designing a NI trial is the need to provide a rigorous scientific justification for the choice of NI margin. There is a considerable risk of accepting less than effective experimental therapies from NI trials with nonrigorous margins. Other basic requirements of a well-designed NI trial would be sample size calculation taking NI margin into account, a clear description of the use of 1-or 2-sided confidence intervals and both per protocol (PP) and intention-to-treat (ITT) analysis [7].
Another area of concern is the reporting and interpretation of NI RCTs in trials wherein the NI was not established or was inconclusive. Investigators' personal agendas such as personal, financial, and intellectual conflicts of interest can influence how research findings are presented. Authors can shape the way readers interpret their results in a variety of ways. Distorted presentation or interpretation of non-significant trials either consciously or unconsciously is known as ''spin'' [8].
Against this background, we performed a systematic review of the literature to identify NI RCTs involving antiretroviral therapies in ARV-naïve and -experienced patients and evaluated the methodological quality and reporting standards by applying the extended CONSORT statement for those trials. We also aimed to identify the strategies, extent and level of spin in trials in which NI was inconclusive or was not established. The following keywords were used: noninferiority; clinical trial; trial; antiretroviral, highly active antiretroviral therapy; and HAART. The search strategy of PubMed is available in the Supporting Information Text S1.

Study Selection
We searched for NI RCTs published in any language. RCTs were defined as prospective trials evaluating healthcare interventions in participants randomly assigned to study groups. Noninferiority trials was defined as RCTs which aim to demonstrate that the new intervention is not worse than the comparator by more than a specified small amount, the NI margin (delta margin).
All published studies that assessed the efficacy of new ARV drug combinations or new interventions in comparison to a standard therapy or intervention in both ARV-naïve and ARV-experienced HIV patients were included. We excluded articles that were not RCTs or were reviews and/or comments. Equivalence trials were excluded. Equivalence trials are trials that aim to demonstrate that the study and control treatment effects differ by no more than a specific amount, the equivalence margin.
A list of retrieved articles was reviewed independently by 3 investigators (AVH, VP, AD) in order to choose potentially relevant articles, and disagreements about particular studies were discussed and resolved.

Data Extraction
Data extraction from selected studies was performed independently by 3 investigators (AVH, VP, AD). Disagreement was resolved by consensus. Using a standardized data extraction form, we collected information on lead author, study name, year of study or publication year, study sponsor, study location, duration of study, study design, study sample size, new drug arm, standard drug arm, primary outcomes and secondary outcomes.

Evaluation of Methodological Quality
The following data was extracted from all selected studies: 1) Choice of NI margin, 2) Method of selection of NI margin, 3) Sample size calculation used NI margin, 4) 1-or 2-sided confidence intervals, 5) Blinding method, 6) Statistical analysis -PP, ITT or both-and 7) Main conclusion based on PP, ITT, or both. The establishment of NI was based on confidence intervals reported by investigators: when efficacy was measured by success rates the lower CI should be above the negative NI margin; when efficacy was measured by failure rates the upper CI should be below the positive NI margin. Other conclusions such as the not establishment of NI or inconclusive results followed the explanations of the CONSORT guidelines [3].

Definitions and Evaluation of Spin
Spin was defined in the context of NI trials in which NI was not established or was inconclusive. Evidence of spin was ascertained when one or two of the following were present. 1) Highlighting NI when NI is not established or inconclusive or unclear. 2) Distracting the reader with other results (e.g. secondary outcomes or information from other studies) when NI is not established/ inconclusive or is unclear.
Strategy of spin employed by the authors was determined. The strategies of spin considered were: 1) Focus on statistically significant results (within-group comparisons, secondary outcomes, subgroup analyses, modified population of analyses); 2) Interpreting the negative result of the primary outcome (i.e. not establishing NI or was inconclusive) as showing equivalence; and 3) Claiming or emphasizing NI despite not establishing NI or when inconclusive.
The extent of spin was assessed in the abstract: results section only, conclusions section only, or both. Extent of spin was also assessed in the main text: one section other than conclusions section (results section, or synthesis of the results in the discussion section), in the conclusions section only, in 2 sections, or in all 3 sections.
Lastly, the level of spin in conclusions of the abstract and conclusions of the main text was evaluated. High spin involved no uncertainty in the framing, no recommendations for further trials, and no acknowledgement of not establishing NI for the primary outcome; also, when the new treatment is recommended for use in clinical practice. Moderate spin involved some uncertainty in the framing or recommendations for further trials, but no acknowledgement of not establishing NI for the primary outcome. Low spin involved uncertainty in the framing and recommendations for further trials or acknowledgement of not establishing NI.

Statistical Analysis
Primarily, we stratified the studies on the basis of history of ARV therapy: ARV-naïve studies vs ARV-experienced studies. The baseline risk of patients is different between the two groups: ARV-experienced patients are more likely to develop virological failure and resistance to ARV drugs compared to ARV-naïve HIV patients. Secondarily, we stratified studies by a) year published: before 2007 vs since 2007; studies have shown improvement in reporting standards in RCTs post-CONSORT [9] and b) type of sponsor: government vs pharmaceutical companies; studies have shown that industry sponsored trials tend to draw pro industry conclusions (sponsorship bias) [10]. We did not attempt to formally use p values for comparisons between ARV-naïve and ARVexperienced studies, as we anticipated a small sample in each group. the report, or in the decision to submit the article for publication. The researchers are all independent from the funding source.

Assessment of Methodological Quality
All 42 RCTs assessed were of parallel design. Four trials [11,14,34,36] were reported as equivalence trials when in fact they were NI trials. Majority of the studies (38/42) explained why a NI trial was performed as opposed to superiority or equivalence trials. Thirteen trials were phase 3, one was phase 2, one was phase 4, and 27 studies did not clearly state the trial phase. Twenty-seven trials (64%) reported the similarity of the standard arm (comparative arm) to previous efficacy trials respect to inclusion/exclusion criteria, types of drugs and outcomes. None of the studies were placebo-controlled.
The study design characteristics stratified by ARV-naïve and ARV-experienced trials are summarized in Table 3. All studies identified a pre-specified NI margin. All but two of the studies described the NI margin between 7% and 25%. In one ARVnaïve [24] study and one ARV-experienced [48] study upper 95% CI limit of the Hazard ratio was the NI margin. The recommended NI margin between 10 and 12% [53,54] was used in 17/23 (74%) ARV-naïve and 11/19 (58%) ARV-experienced studies. Only 9 (39%) studies in ARV-naïve group and 3 (16%) studies in the ARV-experienced group reported justification for their choice of NI margin or limits of Hazard ratios. In 2 studies, NI margin was selected based on investigators assumptions [12,24]; 4 studies based on other publications or reviews [11,14,35,46]; 2 studies based on guidelines [28,32]; 2 studies, calculated based on previous trial results [20,31]; one study based on investigators assumption and other publications and reviews [15]; one study based on guidelines and calculated from previous trial results [37]. Sample size calculation used the NI margin in 8 (35%) and 13 (68%) studies in ARV-naïve and ARV-experienced trials, respectively. A double-blind design approach was employed in 8 ARV-naïve trials and one ARV-experienced trial. All trials reported results using the confidence intervals approach; 2-sided confidence intervals were used in 22 (96%) and 15 (79%) studies in ARV-naïve and ARV-experienced trials, respectively. Although both ITT and PP analysis were performed in 10 (44%) and 9 (47%) trials in ARV-naïve and ARV-experienced trials respectively, only in 5 studies in the ARV-experienced group was the main conclusion based on both analyses (Table 4). Two studies  each in ARV-naïve and ARV-experienced trials gave their main conclusion based on PP analyses, the main analysis for NI trials ( Table 4). The type of statistical analysis performed was not clear in 2 of the trials in the ARV-naïve group. Only use of NI margin to calculate sample size and blinding method used were significantly different between the two groups of studies (Table 3).
In ARV-naïve trials, NI was established in 13 (57%) studies (Table 4). Of these, one study [13] did not comment on the establishment of NI, although NI was established; NI was not established in 7 studies. Of these, one study [11] concludes ''equivalence not established'', when NI is not established. In one study [14], the authors do not acknowledge that NI is not established and conclude ''equivalence established'' when in fact the result is superior. Two studies [25,26] conclude NI established when the results are actually superior. One study [18] concludes exclusion of inferiority when the results are superior. Two trials though inconclusive are not reported so: one trial [33] is reported as inferior while it has discordant results with ITT and PP analysis, with superiority established by PP analysis and one trial [24] mentions ''the upper (but not lower) CI was higher than the pre-defined margin of NI''. One study [21] was terminated early because of slow recruitment and high rate of early virological failures.
In ARV-experienced trials, NI was established in 12 (63%) studies (Table 4). Of these, one study [36] concluded ''equivalence established'' when in fact NI was established. NI was not established in 5 studies. Of these, one study [37] concludes NI established when the result is actually superior. One study [41] was inferior and was rightly acknowledged so by the authors. One trial [44] was inconclusive with authors mentioning appropriately that they cannot conclude NI as conclusions were discordant (NI established by PP analysis but not by ITT analysis) with respect to the NI margin.
Additional benefit with the new drug arm was claimed in 22 (96%) of the ARV-naïve trials (10 studies with NI not established/ inconclusive/inferior), and in 11 (58%) of the ARV-experienced trials (6 studies with NI not established or inconclusive). Additional benefits most commonly claimed were less adverse events, improved lipid profile and low rates of virological failure. All studies claiming additional benefit clearly explained the benefits.   CD4 T cell count at month 6 (and at least 4 weeks after last IL-2 cycle), with treatment success defined as maintaining randomization assignment and having a CD4 T cell count at least 90% of baseline. All but two studies [21,40] claiming additional benefit had analysis performed to support their claims. Study design characteristics stratified by year of publication and type of sponsor are summarized in Table S1 and Table S2, respectively. None of the study design characteristics were different between the two groups when stratified by year published; only use of NI margin to calculate sample size and blinding method used had different distributions between the two groups of studies when stratified by type of sponsor.
Of the 15 studies in which spin was identified, the most common (8/15) strategy of spin employed was focusing and highlighting of statistically significant results which included within-group comparisons, secondary outcomes, subgroup analyses, and/or modified population of analyses. In total, 10 abstracts were classified as having spin, of which 6 had spin in both results and conclusions sections and 4 in conclusions section only. Level of spin in conclusions section of abstract was 'high' in 6 studies. In total, 11 articles were classified as having spin in their main text. More than 50% of the articles (8/15) had spin in at least two sections of the main text while 3 studies had spin in one section of the main text. Level of spin in conclusions section of the main text was 'high' in 4 studies.
Strategies, extent and level of spin in studies stratified by year published and type of sponsor are summarized in Table S3 and  Table S4, respectively. No differences were observed.

Principal Findings
We investigated the methodological quality and reporting standards of RCTs of HIV NI trials. The overall quality of HIV NI RCTs was poor. The main deficiencies were lack of reference of historical data on the active comparator, no information on method of selection of NI margin, not taking the NI margin into account while determining sample size, inadequate blinding of patients, and failure to perform both ITT and PP analysis. Other flaws encountered less frequently were usage of terms equivalence and NI interchangeably and not clearly stating so when the trial results are inconclusive or superior. We also identified high frequency of spin in NI trials in which NI was not established or was inconclusive. Most common strategy of spin observed in these ART treatment failure determined by a plasma HIV RNA $10,000 copies on any one evaluation, a plasma HIV RNA level $1,000 copies on two consecutive measurements, a plasma HIV RNA level .400 copies/ml at the end of the study, a CD4+ cell count decrease of .30% from baseline on 2 consecutive measurements, death attributed to study participation or occurrence of an opportunistic infection. trials was the focus on statistically significant results for other analyses.

Methodological and Reporting Standards in HIV Literature
The findings of our study are consistent with previous studies assessing methodological quality and reporting in HIV NI trials. Parienti et al investigated methodological standards of NI HIV trials reported in pre-specified select journals of high impact factor between 2001 and 2006 [2]. Four out of 18 studies provided rationale for the NI margin and 7/18 studies performed only ITT analysis for their primary endpoint. In studies with both ITT and PP analysis, main conclusion was based on ITT analysis, with the exception of one study. In a review of company-sponsored phase 3 NI trials between 2000 and 2007, Hill et al discussed the implications of study design for the choice of endpoints and sample size calculations [54]. They report inconsistencies in design and interpretation of HIV NI trials and stress on the importance of adopting standardized guidelines in conducting NI trials. In a recent study, statistical methods of 11 HIV NI trials published in 2010 were analyzed [1]. They noted that the conclusions of these trials were heavily dependent on statistical methods used to estimate confidence intervals. Both two-sided 95% CI and the onesided 97.5% CI can be used for assessment of NI. The clue is not to reach a wrong conclusion using the wrong CI or alpha level. There is also the case that both CIs can reach different conclusions, but this is an uncommon situation [55].
Statistical decision procedures based on confidence limits are not the only valid and efficient inferential methods for establishing NI. Kaul et al [55] also refer to the use of the hypothesis-testing framework. Here, the null hypothesis of inequality (risk difference is greater than or equal to the margin) is rejected in favor of the alternative hypothesis of equality (risk difference is less than the margin) if the 1-sided P value is less than 0.025. These authors concluded that the judgment of NI is based on 3 prerequisites: 1) The new treatment exhibits therapeutic NI to the standard treatment; 2) the new treatment would exhibit therapeutic efficacy in a placebo-controlled trial, if such a trial were performed; and 3) the new treatment offers ancillary benefits with respect to safety, tolerability, convenience, or cost. The establishment of therapeutic NI is based on the a priori definition of NI margin, the adequate power of the trial, the consistency of the active control effect with that in historic trials, the similarity of design and conduct with historic trials, and the stability of the NI with alternative analytical Table 3. Study design characteristics stratified by type of trial population.

Spin in RCTs
Assessing RCTs for flaws in reporting and interpretation in terms of strategies, extent and level of 'spin' is a relatively new concept. Boutron et al identified the nature and frequency of spin in superiority RCTs with statistically non-significant results for primary outcomes [8]. All RCTs published in the month of December 2006 were analyzed and 72/205 RCTs were found to have statistically nonsignificant results. The strategies of spin were diverse, 68% and 61% of the abstracts and the main text, respectively were found to have spin in at least 1 section with high level of spin in 33% of abstracts conclusions section and 26% of the main-text conclusions. We adopted the definitions and classification scheme of spin from this study and applied them in the context of HIV NI trials where NI was not demonstrated or inconclusive.

What our Results Add to Existing Literature
Double blinding was only used in 9 trials, although guidelines suggest using blinding whenever possible to minimize the risk of bias, especially information bias [6]. Although all included NI HIV trials pre-specified the NI margin, most did not explain the reasoning behind the selection of a given NI margin, and most of ARV-naive trials did not use the NI margin for sample size calculations. In most of the studies, the reasoning provided for selection of NI margin was not scientifically well grounded and were based on investigators assumption or based on other publications or reviews. We could not determine whether this was due to space limitations or due to a real lack of definition in the trial protocol. The clinical and statistical reasoning behind the selection of the appropriate NI margin is essential to be appropriately described in the manuscript [2,5,6]. Two thirds of the included trials reported the similarity of the current standard arm to previous trials (where the efficacy of the standard arm was established) with respect to outcomes, drug doses and inclusion/ exclusion criteria. Any differences in these items should be described and justified [3]. Also, most of trials based their study conclusion on ITT analysis only. NI trials favor the PP analysis, which excludes patients with major protocol violations; by excluding these patients, which is expected to make the groups more similar, it is thought that analysis of the PP population may be more likely to show differences between treatments. However, both ITT and PP analyses are required to demonstrate NI [6,51], and this was only true in 5 of our selected trials. Also, reporting of most methodological characteristics was found to be lacking irrespective of the history of ARV therapy, year published and type of sponsor.
We have developed a methodology to evaluate the presence, extension and degree of spin in NI trials, following the recommendations of a non-significant superiority trial environment [8]. Among the trials where NI was not demonstrated or was inconclusive, it was quite common to deviate the attention of readers to significant secondary analyses, and also to conclude 'equivalence' or even to stress the finding of NI where there is no. We also found that studies with spin showed it in several sections of the manuscript and abstract, and usually of moderate or high Table 5. Spin in trials where non-inferiority was not established or was inconclusive by type of trial population. degree. We expect researchers use this methodology to avoid or mostly minimize spin in the reports of their NI trials and that this improves the correct and balanced interpretation of their findings. Our study highlights that reporting of the methodology of NI HIV trials is still deficient in comparison to previous evaluations [2,51]. Although some time is necessary to adopt recommendations from guidelines, we strongly suggest following the checklist of the CONSORT statement on NI trials [3]. Several journals have adopted the CONSORT guidelines and its extensions, but some major infectious diseases journals are not among the endorsers [52]. If investigators do not appropriately report basic information about the methodology and interpretation, physicians and policy makers may be misled with the conclusions of these trials. However, there is no formal publication evaluating the effects of following the CONSORT guidelines on the reporting of NI trials. This has been done for superiority trials, where following guidelines improved the quality of reporting [4].

Limitations
There are some limitations to our study. Some specific methodological items might have been conducted but not reported by the authors in the reports we assessed. We did not contact individual trial investigators for any missing items in their reports or trial protocols; instead we solely relied upon what was reported of specific items. There is a degree of subjectivity involved in assessment of spin. However, we pre-specified the evaluation of the presence, extent and degree of spin. We tried to limit this investigator driven bias by performing data extraction to a standardized data extraction sheet. This was done independently by three reviewers and disagreements were resolved by consensus.

Summary and Recommendations
We described the most comprehensive systematic review to date of NI RCTs in HIV literature. Our findings demonstrate the prevalence of deficiencies in design, reporting and interpretation of NI RCTs in ARV-naïve and ARV-experienced HIV patients.
There is a clear need for improving standards of methodology and reporting by following established guidelines when designing and evaluating RCTs. Reviewers of journals, as well as readers should be more aware of these shortcomings in reports of NI RCTs in HIV patients. Rigorous implementation of higher standards in trial design and fully transparent reporting of results will not only improve reliability of the studies but also lead to appropriate appraisal, interpretation and application of results to patient care.