Antiretroviral Therapy Optimisation without Genotype Resistance Testing: A Perspective on Treatment History Based Models

Background Although genotypic resistance testing (GRT) is recommended to guide combination antiretroviral therapy (cART), funding and/or facilities to perform GRT may not be available in low to middle income countries. Since treatment history (TH) impacts response to subsequent therapy, we investigated a set of statistical learning models to optimise cART in the absence of GRT information. Methods and Findings The EuResist database was used to extract 8-week and 24-week treatment change episodes (TCE) with GRT and additional clinical, demographic and TH information. Random Forest (RF) classification was used to predict 8- and 24-week success, defined as undetectable HIV-1 RNA, comparing nested models including (i) GRT+TH and (ii) TH without GRT, using multiple cross-validation and area under the receiver operating characteristic curve (AUC). Virological success was achieved in 68.2% and 68.0% of TCE at 8- and 24-weeks (n = 2,831 and 2,579), respectively. RF (i) and (ii) showed comparable performances, with an average (st.dev.) AUC 0.77 (0.031) vs. 0.757 (0.035) at 8-weeks, 0.834 (0.027) vs. 0.821 (0.025) at 24-weeks. Sensitivity analyses, carried out on a data subset that included antiretroviral regimens commonly used in low to middle income countries, confirmed our findings. Training on subtype B and validation on non-B isolates resulted in a decline of performance for models (i) and (ii). Conclusions Treatment history-based RF prediction models are comparable to GRT-based for classification of virological outcome. These results may be relevant for therapy optimisation in areas where availability of GRT is limited. Further investigations are required in order to account for different demographics, subtypes and different therapy switching strategies.


Introduction
Current HIV-1 guidelines recommend genotypic resistance testing (GRT) both before starting antiretroviral therapy (ART) and at treatment failure. However, appropriate funding and/or facilities to perform GRT may not be available in low to middle income countries (LMIC), leaving physicians with switching therapy based solely on the patient's clinical background. Coupled with virological and immunological measurements, treatment history (TH) is one of the most crucial factors to play a role in the response to a new treatment regimen. In fact, current (2010) recommendations of the International AIDS Society-USA Panel [1] state that treatment history should be considered when designing a new regimen.
In LMIC there is still need for cheap viral load tests to identify early viral failures and limit the emergence of resistance [2]. In the past five years there has been an increase in HIV/AIDS surveillance, but there is still a general lack of rational collection and utilisation of the data as well as of appropriately trained medical staff [3,4,5]. Specific approaches for ART management have been designed by the World Health Organization [6]. Public health programs leading to earlier HIV diagnosis and initiation of ART are expected to improve patient outcomes in LMIC [7].
It has been shown that drug costs are a major factor impairing optimal HIV-1 drug sequencing [8]. Access to second-line ART regimens in LMIC is problematic, mainly because of the expense of HIV protease inhibitors (PI) [9], and CD4 monitoring is often the sole surrogate marker available to guide the management of the infection [10]. However, immunologic criteria to predict which patients have not achieved virological suppression results in significant misclassification of treatment responses [11]. Attempts to model the virological efficacy of ART in LMIC have been proposed by Colebunders et al. [12].
Given the need for low-cost viral load assays, evaluations of alternative technologies have been carried out [13,14]. The costs for genotype resistance testing (GRT) also remain prohibitively high for most LMIC, with only a few facilities being able to support the expenses, execute the tests and exploit the results [15]. New relatively cheap technologies for GRT and non-B subtypes analysis have also been proposed [16], but these have not yet been propagated sufficiently. An additional challenge is the genetic diversity of HIV-1, since most infected patients in developing areas harbour non-B subtypes that could respond to therapy differently from subtype B, the clade which has been extensively treated in Western countries [17,18].
The virological benefit of genotype-guided treatment decisions has been convincingly demonstrated in the last decade [19,20,21]. Although investigated to a lesser extent, genotype-guided treatment decisions were shown to be superior to those based on the medical history [22]. Accordingly, there is a plethora of GRT-based decision systems either for single drug susceptibility scoring or combined ART (cART) optimisation, from rule-based [23,24,25] to machine learning approaches [26,27,28,29,30,31,32]. On the other hand, attempts to guide treatment decisions using the TH information alone have been exploited only recently, using artificial neural networks, with extremely promising results [33].
In this work we aimed at investigating a set of statistical learning models -called random forests -for optimisation of antiretroviral therapy in the absence of GRT information, using the treatment history as a surrogate predictor. Furthermore, we compare the performance of TH-based to those of GRT-based models.

Ethics statements
This study uses anonymous retrospective data from European merged study cohorts and biological material was not employed in any step of the analysis. All the single data providers had previously obtained patients' informed consent for the execution of retrospective studies and their inclusion in merged cohorts, accomplishing national and international ethical issues. The study design was eventually approved by a global scientific committee and by local scientific committees of each single data provider.

Study design
The EuResist retrospective database (http://www.euresist.org/) was used to extract patients' treatment change episodes (TCE), previously introduced by the HIV Resistance Response Database Initiative [31] and by the Forum for Collaborative HIV Research [34], in the so-called form of standard datum (SD).
Each SD was defined as a patient's new regimen (TCE), either a first-line or a subsequent line of therapy, coupled with a follow-up HIV-RNA measured at 8 weeks (ranging from 4 to 12) and at 24 weeks (ranging from 20 to 28) of unmodified therapy.
TCE included antiretroviral compounds approved by the Food and Drug Administration and by the European Medicine Agency: the nucleoside/nucleotide reverse transcriptase inhibitors (NRTI) lamivudine, abacavir, zidovudine, stavudine, zalcitabine, didanosine, emtricitabine and tenofovir disoproxil fumarate; the non-NRTIs (NNRTI) efavirenz, nevirapine and etravirine; the protease inhibitors (PI) amprenavir/fosamprenavir, atazanavir, indinavir, lopinavir, nelfnavir, full-dose ritonavir, boosting-dose ritonavir, saquinavir, tipranavir and darunavir; the fusion inhibitor enfuvirtide. Each drug was represented by a binary variable encoding its presence/absence in a TCE. No restrictions concerning the number of drugs included in a regimen were applied. Both suboptimal treatment regimens made of ,3 drugs or salvage regimes with .4 drugs were included. It was previously shown that the inclusion of suboptimal TCE can improve data-driven model performance [31]. The number of drugs in the regimen was considered as a numerical variable.
Each TCE was provided with a contemporary HIV-1 genotype (spanning protease and reverse transcriptase genes) or the closest within 90 days before the TCE start date. The viral genotype was encoded as binary vector of mutations, insertions and deletions with respect to the HIV-1 consensus B reference, and the viral subtype was determined by a BLAST search on the latest subtype reference set provided by the Los Alamos repository (http://www. hiv.lanl.gov/).
The closest (e.g. baseline) HIV-RNA and CD4+/CD4% cell count measurements previous to the TCE start date (obtained at most 90 days before, provided that no other therapies were started and stopped during this time window) were collected. Patient's demographic information was also retrieved, including age, gender, mode of HIV transmission (drug user, heterosexual, homosexual/bisexual, blood products, mother-to-child transmission), and country of origin.
We associated to each TCE the previous patient's drug exposures, codifying a binary variable for each compound if that drug was experienced for more than 12 months. Three additional binary variables were defined summarising the exposures to the NRTI/NNRTI/PI classes, and a numerical variable representing the regimen line (any drug change in a combination therapy for any reason). We defined a virological success at 8-weeks, as the achievement of an undetectable HIV-RNA or a decrease of at least 2 Log 10 from the baseline HIV-RNA. At 24-weeks, the virological success was defined as the achievement of an undetectable viral load. Because of the inclusion of many viral load data obtained with non-ultrasensitive assays, a 500 cp/ml threshold was applied for the definition of undetectable HIV-RNA.
Additional details for the study design, the SD definition and data extraction procedures and constraints have been discussed previously [30,31,34]. For simplicity, we will refer to the two defined 8-week and 24-week SD sets, containing viral genotypic information, as SD8G and SD24G, respectively. Of note, these TCE correspond exactly to those used in [30], in order to maintain a fair comparison with other tested methods, such as expert rule bases. Indeed, the EuResist DB is periodically updated from the local sources (currently Italy, Germany, Sweden, and Luxembourg) participating to the network.
Two other data sets were extracted using the same notion of SD, except for relaxing the need of a baseline GRT, and including additional data from the Virolab study group (http://www. virolab.org/), comprising the cohorts of Belgium (Katholieke Universiteit Leuven) and Spain (fundaciò irsiCaixa). These two sets, containing treatment history but not mandatorily baseline GRT, were named SD8H and SD24H, respectively.

Statistical methods
Random Forests (RF) were used to predict 8-and 24-week outcomes and to assess variable importance [35]. RF are a nonlinear statistical learning methodology for classification and regression. They ensemble several decision trees (usually from hundreds to thousands) and give predictions by taking either the majority vote or the average of the single trees' outputs. Specifically, each single tree is fully grown on a bootstrap sample of the training data, without pruning the leaves, and at each node split only a subset of the variables is considered. RF present many advantages with respect to other non-linear machine learning methods, such as neural networks, since they usually yield high performance, are robust to over-fitting, can handle a large number of variables, and provide a measure of variable importance.
In our study, the tree number and number of candidate variables at each split of RF were optimised preliminarily with a bootstrap approach. The area under the receiver operating characteristic curve (AUC) [36] was adopted to evaluate the model fit. The AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, whereas a receiver operating characteristic (ROC) plot provides a graphical evaluation of true-positives (sensitivity) versus false-positives (1-specificity) tradeoffs. Multiple 10-fold Cross Validation (CV) and Bengio's corrected t-test [37,38], controlling the false discovery rate with Benjamini-Hochberg method, were used to compare model performance.
The following nested models were defined: (i) full set of input covariates (including GRT and TH); (ii) full set of input covariates, excluding GRT information, but keeping TH; (iii) full set of input covariates, excluding TH, but keeping GRT; (iv) current cART, baseline HIV-1 RNA and CD4; (v) current cART.
We also designed a set of sensitivity analyses as follows: in order to investigate the potential bias derived from the fact that most of the TCE present in the data base have been decided after a GRT, we excluded from SD8H/SD24H all TCEs for which a baseline GRT was available, and we compared models with/without baseline HIV-RNA load as a covariate. From this reduced data set, we selected first-and second-line regimens commonly available in LMIC (TCE containing enfuvirtide, tipranavir, darunavir, and etravirine were removed) and repeated model evaluations. Finally, SD8G was split into two sets according to the viral subtype of each associated GRT, grouping B and non-B subtypes. RF models were trained on the subtype B dataset and tested against the non-B.
The analyses were carried out using R, open-source software for statistical computing [39], and Weka, a data-mining suite [40].
Relaxing the need for a baseline GRT, the data sets SD8H and SD24H were extracted (n = 12,932 for both outcome points). Seventy-one percent of patients reached virological success at 8weeks and 67% of patients reached virological success at 24-weeks. Differences in proportions of SD8H and SD24H with SD8G and SD24G success rates yielded p,0.0001 and p = 0.13, respectively. The percentage of suboptimal therapies was 8%. The majority of cART contained lamivudine (63%), zidovudine (33%), tenofovir (28%), stavudine (27%), any PI/r (29%, where lopinavir/r accounted for 23%), didanosine (21%), abacavir (19%), and efavirenz (11%). There was a low percentage of darunavir (1.3%) and tipranavir (1.3%) containing cART. Comparison between GRT-based and TH-based RF models Table 2 and Figure 1 show detailed performance results and ROC plots for SD8G and SD24G, comparing RF models (i) through (v), under multiple 10-fold CV. Model (i), which includes the full set of covariates (including both GRT and TH), was the best performing in terms of AUC for both time points. However, model (ii) (TH but not GRT) had AUC distributions not significantly different from model (i) at 8-weeks (p = 0.25) although inferior at 24-weeks (p = 0.04). The same held for model (iii) (GRT but not TH), which was comparable to (i) at 8-weeks (p = 0.28) and inferior at 24-weeks (p,0.0001). On the other hand, models (iv) and (v) were always significantly outperformed by models (i) to (iii) at any time point (all p,0.0001). Model (ii) and (iii) did not show significant differences in AUC at both time outcomes (p = 0.7 and 0.113).
We also compared AUC of models (i), (ii), and (v) by stratifying for the therapy line (first-, second-, third-, fourth-line or more). Detailed results are shown in Table 3. Interestingly, both model (i) and (ii) show significantly better performance as compared to the base cART model (v) only at late switches. Figure 2 (panel a) depicts the AUC for each model and therapy strata over a single 10-fold CV run using the SD8G data set, and the proportion of virological successes for each therapy line (panel b).

Sensitivity analyses
Since all the extracted data sets may be biased due to the fact that in Europe the prevalence of genotype-guided TCE is high, we excluded from SD8H and SD24H all the TCE for which there was a baseline GRT available (obtaining n = 9,623). This is meant to be an approximation to the set of non-genotype-guided therapies, although some baseline GRT may have been available but not inserted in the data base. In a sub-analysis, then, we deleted the HIV-RNA variable. The proportion of successes was 71% and 65% at 8-and 24-weeks, respectively. AUC plots for the sensitivity analyses, along with variable importance evaluation on the SD8H data set, are available as supplementary material ( Figure S1, Figure S2 and Material S1).

Discussion
Following the demonstration of a major role for HIV-1 drug resistance mutations in response to antiretroviral therapy, GRT has been the standard decision tool to face HIV-1 drug resistance in clinical practice. The caregiver can use one or multiple systems developed to translate HIV-1 genotype into clinically relevant information. Commonly used genotype interpretation systems do not integrate additional patient information as an input to complement HIV-1 genotype in the effort to predict response to cART. A few experimental systems derived from statistical learning have recently been described which integrate patient and virus information to build an effective antiretroviral regimen [30,31,41]. These systems confirmed that TH has an impact on prediction of response to therapy.
The expanding cART coverage of LMIC poses new challenges for optimising treatment due to limited availability of second-line and later regimens and frequent lack of facilities or funding to support GRT. Our analysis suggests that availability of basic patient data supplemented with simple categorical indicators of past treatment can provide short-and medium-term response prediction as accurately as in the presence of HIV-1 genotype information. This finding is not surprising since the virus genotype is actually shaped by anti-HIV-1 compounds and can thus summarize the patient TH. It is interesting to note that the performance of the GRT-based model, which did not include any information on TH, decreased, although not significantly, at 24-weeks, compared to that of TH-containing models. This may be a consequence of the fact that bulk genotyping does not capture minority variants that have been selected by previous exposures [42] or resistant variants that survived in a reservoir and will be quickly reselected. It should be worth evaluating whether using the cumulative or historical genotype, made by the sum of all the available genotypes, improves accuracy at a later time point [43,44].   Application of ultra-deep sequencing procedures could also provide high-sensitivity genotype data [45] possibly resulting in better prediction of response to therapy. We also found that both the RF models with GRT+TH and the TH alone increase their performance in predicting the virological outcomes when considering subsequent therapy lines. This suggests that the outcomes of the first/second lines are somehow less dependent on the viral strain or on the early ART exposures, while the evaluation of TH and the execution of a GRT boost up the confidence in predicting the correct outcome at later ART stages. In principle, continuous measures of exposure to therapy, i.e. corrections based on duration and last time of use, could improve the power of TH as a covariate. However, such detailed information is often hard to obtain, particularly in the absence of centralised (electronic) medical records. The simplicity of the input information should expedite further training of TH-based systems with data derived from these areas. In this context, the data required for querying a TH-based expert system should be kept simple in order to encourage its use. The input dataset investigated in this work was made of demographic data typically available at any HIV clinic (patient age, gender and route of infection), baseline markers (CD4 cell count and HIV-RNA load), just coupled with binary indicators for past use of individual drugs (and derivatively of drug classes). We showed that when the HIV-RNA covariate is deleted the AUC performance decrease, which is an argument in favour of the use of viral load monitoring for the optimization of cART in LMIC. This reconciles with another recent study from Revell et al. [33], that explored either artificial neural networks or RF for the prediction of virological outcomes in absence of GRT information, using data from Europe, North America, Japan and Australia (.3,000 TCE). Revell et al. showed that, besides the GRT, the  viral load is the most important variable, and that TH improves the accuracy of the prediction of virological outcomes. In a sub-analysis, we showed that the accuracy of the prediction models decreased when training on HIV-1 subtype B data and validating on non-B subtype data. While non-B grouping is an artificial approach not corresponding to any biological entity, differences in response to certain cART regimens with specific subtypes may occur and must be considered in the development of treatment decision tools.
A limitation of our study is that the datasets used do not correspond to a typical scenario of LMIC, since models were applied on data composed by patients cared in Europe. However, we performed sensitivity analyses excluding GRT-guided TCE, and including only first-and second-line regimens with drugs available in LMIC. But a more detailed investigation is advisable since the outcome distributions varied significantly across different datasets. Although the compounds considered in these analyses can be accessible in LMIC, drug combinations might differ.
In addition, in LMIC additional factors like co-morbidity with other diseases, cost of treatment, distance to treatment centres, interruptions of stocks for drugs during certain periods, and stigma can play an important role in treatment outcome. This information needs to be integrated in a system specifically designed for LMIC, provided that co-operation efforts for data collection are appropriately set up.
Another critical point is the fact that therapy switches in EuResist are mainly driven by virological monitoring, and not by clinical/immunological criteria, that are commonly used in LMIC. Finally, in this study a 500-copy threshold in the definition of undetectable viral load was used. This was due to the inclusion of HIV-1 RNA data derived from old generation laboratory assays: although this is a strong limitation for a model designed for highincome countries, where the virological reduction below 50 cp/ml would be the necessary outcome, in LMIC the end-point might be revised by considering the HIV-RNA only where testing is available and CD4/clinical monitoring otherwise.
Despite these limitations, prediction of response to treatment based on TH rather than on GRT appears to be an appealing strategy providing a possibility to help clinicians with data-driven systems in the absence of HIV-1 genotype information. We realize that the here described model is only applicable when therapy failures are mainly judged from viral load monitoring, and when further lines of treatment are available, which is currently not the case in LMIC. However, the concept of the model warrants optimism towards the development of more appropriate models for LMIC. Thus, further development along these line are warranted, along with a coordinated effort to collect HIV-1 treatment related data from the areas that could maximally benefit from it.

Supporting Information
Material S1   Figure S2 ROC plots of a single 10-fold CV run for RF models on SD8H and SD24H excluding instances with an available baseline GRT (n = 9,623) and with/without baseline HIV-RNA load as a covariate.