Comparison of HIV-1 Genotypic Resistance Test Interpretation Systems in Predicting Virological Outcomes Over Time

Background Several decision support systems have been developed to interpret HIV-1 drug resistance genotyping results. This study compares the ability of the most commonly used systems (ANRS, Rega, and Stanford's HIVdb) to predict virological outcome at 12, 24, and 48 weeks. Methodology/Principal Findings Included were 3763 treatment-change episodes (TCEs) for which a HIV-1 genotype was available at the time of changing treatment with at least one follow-up viral load measurement. Genotypic susceptibility scores for the active regimens were calculated using scores defined by each interpretation system. Using logistic regression, we determined the association between the genotypic susceptibility score and proportion of TCEs having an undetectable viral load (<50 copies/ml) at 12 (8–16) weeks (2152 TCEs), 24 (16–32) weeks (2570 TCEs), and 48 (44–52) weeks (1083 TCEs). The Area under the ROC curve was calculated using a 10-fold cross-validation to compare the different interpretation systems regarding the sensitivity and specificity for predicting undetectable viral load. The mean genotypic susceptibility score of the systems was slightly smaller for HIVdb, with 1.92±1.17, compared to Rega and ANRS, with 2.22±1.09 and 2.23±1.05, respectively. However, similar odds ratio's were found for the association between each-unit increase in genotypic susceptibility score and undetectable viral load at week 12; 1.6 [95% confidence interval 1.5–1.7] for HIVdb, 1.7 [1.5–1.8] for ANRS, and 1.7 [1.9–1.6] for Rega. Odds ratio's increased over time, but remained comparable (odds ratio's ranging between 1.9–2.1 at 24 weeks and 1.9–2.2 at 48 weeks). The Area under the curve of the ROC did not differ between the systems at all time points; p = 0.60 at week 12, p = 0.71 at week 24, and p = 0.97 at week 48. Conclusions/Significance Three commonly used HIV drug resistance interpretation systems ANRS, Rega and HIVdb predict virological response at 12, 24, and 48 weeks, after change of treatment to the same extent.


Introduction
The effectiveness of antiretroviral therapy has been limited by the development of HIV-1 drug resistance. Resistance occurs frequently in patients and may decrease both the magnitude and the duration of the response to treatment [1].
Several prospective studies have shown that the use of genotypic resistance analysis to guide the new treatment choice for patients failing their current HAART improves virologic outcome [2,3,4,5]. The complex mutational patterns are however difficult to interpret, due to the many different drug resistance mutations [6] and the varying levels of decreased susceptibility of these mutations to different drugs. This led to the development of several interpretation systems [7], which provide rules to help physicians interpret HIV-1 drug resistance genotyping results.
ANRS, Stanford HIVdb, and Rega are the three most commonly used and publicly available drug resistance interpretation systems, which are all regularly updated. The systems are rule based algorithms, providing scores for specific (combinations of) mutations. The scores are then translated into different levels of susceptibility. The rules for these scores are based on literature and expert's opinion. The Rega system was the first to be validated in drug experienced patients [8,9], followed by ANRS [5,9] and Stanford [9].
A good way to compare systems is by using virological response data in correlation with the prediction of interpretation systems. However, some systems may be better for short-term virological outcomes, and others may be better for longer-term outcomes. The results of a comparison between systems may therefore depend on the virological outcome time point that is used. In this study, a large data set of HIV-1 patient's sequences was collected together with virological data to compare the three most commonly used interpretation systems in genotypic susceptibility score and in the prediction of virological response. We used 3 different virological outcome time points to analyze the effect of therapy duration on the prediction of systems.

Study population
Data was made available through the EU-sponsored ViroLab and EuResist projects [10,11,12]. The ViroLab project comprises data from Belgium (Katholieke Universiteit Leuven), Italy (University of Brescia and Catholic University of the Sacred Heart of Roma), Spain (IrsiCaixa Badalona), and The Netherlands (Erasmus Medical Centre Rotterdam). The EuResist project consists of data from Italy (ARCA database; http://www. hivarca.net/), Germany (AREVIR database); Sweden (Karolinska Infectious Diseases and Clinical Virology Department), and Luxembourg (Retrovirology Laboratory, CRP-Santé). The timeperiods of available therapies in the ViroLab and EuResist database ranged between 1996 and 2008.
These databases were used to extract treatment change episodes (TCEs). TCEs were defined, in patients aged $18, as follows (figure 1): (1) a baseline genotype (Reverse transcriptase and Protease region) and viral load (detectable being .50 copies/ml) obtained within 90 days before and 8 days after treatment change; (2) at least one follow-up viral load measurement at 12 (range: [8][9][10][11][12][13][14][15][16], 24 (16-32), or 48 (44-52) weeks; (3) no changes in therapy between the time of the baseline viral load and the follow-up viral load measurement. In case more genotypic tests or viral load measurements were performed within an analyzed treatment period, the value closest to the start of therapy or the follow-up measurement time was used.

Interpretation systems and genotypic susceptibility scores (GSSs)
The genotypic results were interpreted using three commonly used rule-base interpretation systems: Agence Nationale de recherches sur le SIDA (ANRS) version 17; Stanford HIVdb, version 5.1.2; and Rega Institute version 8.0.1. The ANRS and Rega both report 3 levels of resistance: susceptible, intermediate, and resistant. For ANRS, we translated the definitions 'susceptible', 'intermediate', and 'resistant' into susceptibility scores of 1, 0.5, and 0, respectively. For the Rega scores, we used the weighted score suggested by Rega, which uses the following changes: NNRTI were scored 0.25 (with the exception of etravirine with a score of 0.5) for intermediate resistance, and ritonavir-boosted PI were scored 0.75 and 1.5 for intermediate resistance and susceptible, respectively. The Stanford algorithm uses 5 levels of resistance. We assigned the following scores to these 5 levels of Stanford: 0, 0.25, 0.50, 0.75, and 1 for respectively the high-level resistance, intermediate resistance, low-level resistance, potential low-level resistance, and susceptible. In a separate analysis we used the unweighted scores for Rega. We assigned the scores 0, 0.5, and 1 to the 'resistant', 'intermediate', and 'susceptible' groups for all drugs, respectively. The three systems did not include a score for ritonavir. We therefore excluded eleven TCEs that used ritonavir as only protease inhibitor, as we could not calculate a GSS of their treatment regimens.
The arithmetic sum of the individual score for the specific drugs provided the total GSS of that treatment. For brevity, we classified the total GSS score in the following categories: 0 to ,1, 1 to ,2, 2 to ,3, 3 to ,4, and $4. The 0 to ,1 group contains viral sequences almost entirely resistant to the drugs in their regimen, and the $4 group contains viral sequences susceptible to more than 3 drugs given in their regimen.
To calculate the prevalence of drug resistance we used the mutation list published by the International AIDS Society USA (IAS-USA) [13].

Statistical analysis
Kaplan-Meier curves were estimated to determine the association between GSSs and the proportion of TCEs having an undetectable viral load (,50 copies/ml). The association between GSS scores and undetectable viral load was analyzed with a logistic regression. In the multivariate analyses we adjusted for real time to viral load measurement (i.e. number of days between the  TCEs and the follow-up viral load measurement) and log viral load at start of therapy. Furthermore, we used logistic regression, to calculate Odds Ratios for each GSS group compared to the GSS group of 0 to ,1. The receiver operating characteristic (ROC) curves were calculated to analyze the trade-off between the proportion of true-positive (correct virologic response prediction) and false-positive (incorrect virologic response prediction) results across the range of possible prediction cutoffs. The AUC (Area Under the Curve) is a value between 0 and 1 that corresponds to the probability that a randomly selected virologic success receives a higher score than a randomly selected virologic failure. We used the AUCs to calculate how well the systems separate the GSS groups into those with and without undetectable viral load (,50 copies/ml). Robust extra-sample error estimation was obtained by 10-fold cross-validation [14]. We compared the multiple independent runs of the 10-fold cross validation results with a Kruskal-Wallis test. Analyses were performed with the SPSS software package (version 15.0 for Windows, SPSS).

Baseline characteristics of the study population
The baseline characteristics are shown in Table 1. We included 3131 patients in our study, of which most were male (73%), most were infected with subtype B viruses (81.9%), and the median age was 39 years (range 18-78). Of the 3131 patients, 476 (12.7%) had more than one TCE, which leads to a total of 3,763 TCEs included in the study. Of these TCEs, 2,152 had a viral load measurement at week 12, 2,570 at week 24, and 1,083 at week 48. .08], and the median CD4 + cell count was 233 cells/ mL (IQR, 120-371 cells/mL). The most commonly given treatments were lamivudine (59%), tenofovir (37%), and lopinavir (35%). A combination of lamivudine, zidovudine, and lopinavir/r was the most frequently given therapy combination, with a percentage of 8%, followed by 6% for the therapy combination lamivudine, tenofovir, and lopinavir/r.

Genotypic Susceptibility Score distribution
The genotypic susceptibility scores for a TCE was calculated as the total score of genotypic susceptibility scores for all drugs in one regimen as explained in the 'method' section. Figure 3 displays the proportions of cases in each susceptibility category, according to ANRS, HIVdb, and Rega. All systems show that at least three active drugs were started in a large proportion of TCEs. The mean GSS of the three systems were slightly smaller for HIVdb, with 1.9261.17, compared to Rega and ANRS, with 2.2261.09 and 2.2361.05, respectively. The unweighted Rega scores did not differ much from the other scores with a mean of 2.1561.09.
The GSS of TCEs with longer follow-up were slightly higher compared to TCEs with a short follow-up time (data not shown), with baseline GSS means ranging between 1.93 and 2.23 at 12 weeks, 1.98 and 2.29 at 24 weeks, and 1.98 and 2.32 for TCEs with viral load measurement available at 48 weeks.

Prediction of virologic outcomes
The virologic responses of all TCEs are described in Table 2. The percentage of an undetectable viral load (,50 copies/ml) was higher in week 24 compared to week 12. Week 48 did not show a large increase in percentage compared to week 24. TCEs with higher Genotypic Susceptibility Score had a higher change of reaching an undetectable level of viral load. At 48 weeks, in more than 70% of the TCEs with a Genotypic Susceptibility Score of $4, the viral load became undetectable.
Adjusted odds ratios for reaching a viral load below 50 copies/ mL for each unit increase in GSS are reported in figure 4. These predictions of the virological response were similar to the odds ratios without adjusting for log viral load at start of therapy and real time to viral load measurement (data not shown). At all time points, the interpretation systems were significantly predictive of the virological response. Odds Ratios for each unit increase of the The ROC curves in figure 5 depict different cut-off points, for the three interpretation systems. In the table below the graph, the sensitivity, 1-specificity, and specificity are given for these cut-off points. The sensitivity and specificity of the ROC curves for the systems are all similar. The calculated AUCs were around 0.63 at week 12 and 0.68 at week 24 and 48 (shown in Table 3). These AUCs did not significantly differ among the systems (with p-values ranging between 0.60-0.97) at all time points. The AUCs of the unweighted Rega did not differ from the normal ANRS, HIVdb, and Rega scores, with means of 0.63 at week 12 and 0.68 at week 24 and 48. (data not shown).
In figure 6, Kaplan-Meier curves are given, showing clear associations between the GSS groups and the proportion of TCEs  having an undetectable viral load. The GSS group of 4 or higher show the highest proportion of TCEs having an undetectable viral load. The Odds Ratios of each GSS group are given in Table 4 for all time point measurements. In the comparison between the different GSS groups and the GSS group of 0 to ,1, increasing Odds Ratios were found for an increasing GSS. Odds Ratios were higher at week 24 compared to week 12 for all GSS groups and in all three interpretation systems, whereas the results at week 48 did not differ much from those at week 24. Due to the low numbers of included TCEs in GSS group $4 and at week 48, large confidence intervals were seen in these groups.

Discussion
In this study, data from treated HIV-1 patients were modeled to predict virological outcome comparing genotypic drug resistance with the most commonly used interpretation systems. We used logistic regression and AUC calculations and showed in 3,763 treatment change episodes that ANRS, HIVdb, and Rega, do not differ in predicting virological outcomes.
Comparisons of interpretation systems have been previously reported [9,10,15,16,17]. In this work, due to the large study  population, we were able to compare genotypic susceptibility scores between patients using many different drug therapy combinations and control for important possible confounders. The results of our study were in agreement with previous findings [10,16]. In addition to previous work, our study has extensively looked at the differences between the prediction ability of the systems at different time points. We both included short term responses (week 12) and longer term responses (week 24 and 48). An explanation for the findings in this study is that the systems all make use of the same literature available on correlations between genotypic and phenotypic analyses as well as correlations with treatment history and clinical response.
Several studies showed small changes in genotypic susceptibility scores between different systems. For example Ravela et al. [18], that compared 4 different interpretation systems (including ANRS, HIVdb, and Rega), reported a 4.4% complete discordance, with at least 1 system assigning susceptible and another system assigning resistant; 29.2% displayed partially discordance; and 66.4% were complete concordant. However, in this study we found that these differences do not have a large influence on the virological outcome of treatment.
A possible limitation of studies comparing different interpretation systems lies in the translation of the indications from the interpretation systems into numeric values, which are taken arbitrarily. However, we have used the same principles used by authors of HIV drug-resistance algorithms for calculating the genotypic susceptibility score. Therefore we were able to compare the three systems in the way they are used in practice. We also used the Rega scores without the suggestions about weighting of scores for boosted PI drugs and NNRTI. Using these unadjusted scores did not change in GSS distributions and virological outcome to a great extent.
Some novel drugs (etravirine, darunavir, tipranavir) were not frequently used in our study population. Similarly, drugs belonging to the newly approved classes, such as raltegravir and maraviroc, were not included. Therefore, the predictive value we found is not a validation for all individual rules in the system and we did not attempt to validate individual rules. Continuous validations in large dataset with recent drug data will therefore remain needed.
No restriction on therapies was performed; therefore suboptimal regimens (fewer than three full-dose drugs) were included. However, the group of patients receiving suboptimal regimens was small and the same for all three interpretation systems. Furthermore, it was previously demonstrated that removal of suboptimal treatment reduces the accuracy of the models [19].
Much discussion has been going on about which follow-up period is most suitable to validate a system. Short term responses might be more directly attributable to the antiviral drug activity whereas longer term outcomes might be more clinically relevant but more easily confounded by other issues such as loss in adherence, drug discontinuations and switches [20]. In our study less than 1/3 of all cases were left at the 48 week time point measurement. This loss to follow up creates selection bias in this group. Therefore, this 48-week-group may not be representative of the whole study population. The patients, who remain on therapy until the 48 th week after start of therapy, will do better on therapy and will have better virological responses than patients who switch to another therapy at earlier stages. In accordance, we found stronger associations between interpretation systems and virological outcomes at later time points compared to earlier time points in the logistic regression analyses. However, in the logistic regression that compared the different GSS groups to the GSS group of 0 to ,1, the Odds Ratios were similar between week 24 and week 48. Therefore, week 24 may be a well suitable time point to measure long term responses. However, confidence intervals in week 48 were large, because of low numbers of included TCEs, therefore creating a bias at this time-point.
In conclusion, we found that the three most common used interpretation systems do not differ in their ability to predict virological response. Also, when looking into different time points, the prediction abilities between the systems were similar. Since the overall performance is comparable, these systems might evolve towards a more consistent scoring in the future. New breakthroughs might be needed for further improvement in genotypic resistance test interpretation.