The Individualized Genetic Barrier Predicts Treatment Response in a Large Cohort of HIV-1 Infected Patients

The success of combination antiretroviral therapy is limited by the evolutionary escape dynamics of HIV-1. We used Isotonic Conjunctive Bayesian Networks (I-CBNs), a class of probabilistic graphical models, to describe this process. We employed partial order constraints among viral resistance mutations, which give rise to a limited set of mutational pathways, and we modeled phenotypic drug resistance as monotonically increasing along any escape pathway. Using this model, the individualized genetic barrier (IGB) to each drug is derived as the probability of the virus not acquiring additional mutations that confer resistance. Drug-specific IGBs were combined to obtain the IGB to an entire regimen, which quantifies the virus' genetic potential for developing drug resistance under combination therapy. The IGB was tested as a predictor of therapeutic outcome using between 2,185 and 2,631 treatment change episodes of subtype B infected patients from the Swiss HIV Cohort Study Database, a large observational cohort. Using logistic regression, significant univariate predictors included most of the 18 drugs and single-drug IGBs, the IGB to the entire regimen, the expert rules-based genotypic susceptibility score (GSS), several individual mutations, and the peak viral load before treatment change. In the multivariate analysis, the only genotype-derived variables that remained significantly associated with virological success were GSS and, with 10-fold stronger association, IGB to regimen. When predicting suppression of viral load below 400 cps/ml, IGB outperformed GSS and also improved GSS-containing predictors significantly, but the difference was not significant for suppression below 50 cps/ml. Thus, the IGB to regimen is a novel data-derived predictor of treatment outcome that has potential to improve the interpretation of genotypic drug resistance tests.


Introduction
Despite an increasing arsenal and improved potency of antiretroviral drugs, the optimal use of combination antiretroviral therapy against HIV-1 infection remains challenging [1]. Complicating factors include drug interactions and toxicities, adherence to therapy, and development of drug resistance [2]. Because genotypic drug resistance testing is performed on a routine basis today and because mutational patterns are unique for each patient, treatment choices are, in principle, highly personalized. In practice, however, it can be difficult to identify an optimal drug combination for each individual patient due to the combinatorial complexity of both the set of feasible drug combinations and of viral mutational patterns.
In addition to controlled clinical trials, analyzing data from large observational cohort studies is a promising way to identify predictors of treatment outcome, even if the availability of drugs and therapeutic strategies change over time [3]. This approach can be based on modeling the risk of acquiring additional mutations [4], on estimating future drug options [5], on predicting the time to virological failure [6,7], or on classifying the regimens of treatment change episodes (TCEs) as successful versus failing, depending on the patient's response to therapy. A TCE consists of predictor variables including the applied drug combination, viral genotype, treatment history, demographic and clinical parameters, and a response variable such as the change in viral load.
Drug resistance development is driven by viral evolution and thus models of viral evolutionary escape from drug pressure have been proposed to improve therapy response prediction [16,22,36]. Specifically, the individualized genetic barrier (IGB) to drug resistance has been suggested as a predictor of treatment outcome. The IGB is defined as the probability of the virus not to become resistant to a certain drug [37][38][39]. A high IGB means that viral evolutionary escape from the selective pressure of the drug is unlikely. Related quantities are the average number of mutations and the average time to reach drug resistance derived from simulated HIV-1 evolutionary trajectories on an estimated fitness landscape [36,40,41]. This approach has been explored for treatment with zidovudine plus lamivudine and with nelfinavir [42], but it does not scale to the variety of combination therapies observed in clinical databases, because sufficient data for estimating fitness landscapes is available only for a few drug combinations. Earlier, the term 'calculated genetic barrier' has been used to assess the number of mutations necessary to acquire specific drug resistance-associated mutations, which were found to be similar among HIV-1 subtypes [43].
In the present study, we apply a simplified definition of the IGB which can be computed efficiently for any drug combination based on a statistical model that captures the order and the dynamics of accumulating mutations and the associated levels of phenotypic drug resistance [44]. The IGB to resistance to a certain drug is the probability that the virus will not accumulate additional mutations leading to a resistant strain. This drug-specific IGB has been demonstrated to be a strong predictor of virological response in two large observational cohort studies [26,28]. Here, we derive a novel predictor, the IGB to the entire drug combination which measures the genetic potential for evolutionary escape of the virus from the selective pressure of combination therapy.
In order to assess the performance of the IGB as a predictor of treatment outcome, we analyzed TCE data from the Swiss HIV Cohort Study (SHCS) database, a large, long-term observational, multi-center, clinical database with integrated results of genotypic drug resistance tests [45,46]. We identified risk factors of therapeutic failure and constructed models of treatment outcome considering as predictors the applied regimen, treatment history, viral genotype, GSS, drug-specific IGBs, IGB to regimen, and demographic and clinical variables including patient adherence. Overall, we found the IGB to the entire regimen to be the strongest and most significant predictor. Our results demonstrate that the viral genotype is represented efficiently by the IGB to regimen, a single, interpretable probability summarizing the predicted dynamics of viral evolutionary escape.

Results
For each drug, viral evolutionary escape from its selective pressure was modeled using Isotonic Conjuctive Bayesian Networks (I-CBNs). In these probabilistic graphical models, dependencies among mutations are described by a partial order, which defines the genotype lattice, i.e., the set of genotypes compatible with the order constraints, and hence the set of possible mutational escape pathways ( Figure 1). To each genotype, its level of phenotypic drug resistance is associated using isotonic regression, such that drug resistance is monotonically nondecreasing along any mutational pathway from the wild type towards the genotype carrying all mutations. Using cross-sectional matched genotype-phenotype pairs from the Stanford HIV Drug Resistance Database, I-CBN models were learned for a total of 18 antiretroviral drugs (Supporting Figures S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, Supporting Table S2). Each model includes up to eleven preselected mutations (see Methods).
From the I-CBN models, transition probabilities among genotypes were derived and the individualized genetic barrier (IGB) to resistance development to each drug was computed as the probability of the observed genotype not acquiring additional mutations that would transform it into a genotypic state predicted

Author Summary
Drug resistance remains a challenge in the management of HIV-infected patients. The accumulation of mutations during ongoing viral replication is the origin of drug resistance development. Understanding this evolutionary process in a quantitative manner is an important prerequisite for minimizing the risk of resistance development and for the optimal selection of drug combinations for each individual patient. We present probabilistic graphical models for describing the evolution of drug resistance, and we derive the individualized genetic barrier (IGB), a single quantity summarizing the genetic potential of the virus for evolutionary escape from selective drug pressure. The predictive power of the IGB is demonstrated on a large well characterized clinical cohort of HIV patients and compared to classical predictors.
to be resistant. For a drug combination, the IGB was obtained as the sum over all drugs of the regimen of the drug-specific IGBs. Thus, the IGB to regimen can be regarded as the expected number of active components in the drug cocktail taking viral evolutionary escape mechanisms into account. To assess the predictive power of the IGB in a clinical setting, we analyzed a large cohort of HIV-1-infected patients and compared the IGB to several known predictors of therapy response (Figure 2), including the GSS, obtained from the Stanford HIV Drug Resistance Database website (HIVdb 6.2.0).
TCEs from the time period 1988-2010 were derived from the SHCS database (Table 1 and 2) and labeled as either failure or success (see Methods). Therapy success was defined as viral load reduction below 50 cps/ml (400 cps/ml) during treatment. We obtained 2185 (2631) genotype-therapy pairs, including 73% (63%) failures. The usage of individual drugs and the 30 most frequent drug combinations are shown in Supporting Figures S2 and S3, respectively. The historical development of drug usage patterns is reported in Supporting Table S3, where the regimens are annotated as either being recommended as first-line or alternative regimens according to current treatment guidelines [47], or as past first-line or second-line recommended regimens that are still in use in developing countries or occasionally used if drug resistant virus is present at baseline or as salvage regimens, or as regimens that are not in use anymore as first-line regimens but were before, including those still used under special circumstances, such as unusual tolerability.
In order to predict the outcome (failure versus success) of each therapy, we considered applied drugs, demographic and clinical variables, viral genotype, IGBs to received drugs, and IGB to regimen ( Figure 2, Table S1). Univariate logistic regression resulted in a total of 50 (44) features that were significantly associated with therapy outcome ( Figure S22). Among the predictive drugs, the use of ZDV, d4T, 3TC, and NFV were  were used to learn I-CBN models for all drugs separately. The drug-specific individualized genetic barriers (IGBs) are derived from these models. The IGB to regimen is computed for each genotype-therapy pair in the Swiss HIV Cohort Database and its predictive power is assessed in prediction models that also account for classical demographic, clinical, and genetic covariates. doi:10.1371/journal.pcbi.1003203.g002 associated with increased risk of therapeutic failure, while ABC, TDF, FTC, EFV, RTV, LPV/r, ATV, and ATV/r increased the odds of therapeutic success. Most of the significant amino acid changes in the viral protease (PR) gene (10I, 30N, 33F, 46I, 54V, 71V, 82A, 84V, 90M) and reverse transcriptase (RT) gene (39A, 41L, 44D, 67N, 74V, 103N, 118I, 123S, 210W, 215Y, 297R) have been associated with resistance to multiple PR inhibitors (PIs) and RT inhibitors (RTIs), respectively, and all except PR 30N and RT 123S increased the risk of treatment failure. A higher IGB to any of 15 (16) individual drugs increased the chance of successful virological response. The IGB to the entire drug combination and the GSS were also significant predictors.
In the multivariate analysis, only 12 (14) variables were significant, nine (ten) of which are indicating the inclusion of individual drugs in the regimen ( Figure 3). The usage of the nucleoside RTIs (NRTIs) ZDV, ddI, d4T, and 3TC, and of the PIs APV and SQV, were associated with negative treatment outcome, whereas the four boosted PIs (i.e., given together with low-dose RTV to improve their bioavailability) SQV/r, IDV/r, LPV/r, and ATV/r had positive predictive power. Among the many genotype-derived predictors, only GSS and IGB to regimen reached statistical significance at the 1% level in the multivariate model. For the 50 cps/ml success definition, the odds ratio (OR) of therapeutic success was ten-fold higher for the IGB ( ), indicating that the IGB provides an effective summary of the risk of treatment failure due to viral genetic changes. In addition, increased overall maximum (peak) viral load before treatment remained a significant predictor of therapy outcome in the multivariate logistic regression model.
For optimal treatment outcome prediction, we also explored the use of regularized logistic regression models. Specifically, the elastic net, which combines L 1 and L 2 regularization was applied to identify sparse classifiers of therapy outcome. Classifier performance was evaluated in ROC curves summarized by the area under the ROC curve (AUC), and analyzed according to the historical drug usage patterns (Table S3).The competitive models (high AUC) are only those using all clinical and demographic variables, mutations, and drugs (Tables S5, S6, Figure 4). When comparing IGB to GSS as predictors in this setting, we found a significant advantage of the IGB for 400 cps/ml if all other features are included in the models (p~0:01, Wilcoxon rank sum test). Furthermore, the IGB also improves treatment outcome prediction if added to models that already contain the GSS (p~0:0002). For 50 cps/ml, we did not find significant differences in AUC between IGB and GSS when used in prediction models that included all other covariates, nor did the GSS-containing model improve upon adding IGB. The significant increase for the larger dataset with the 400 cps/ml success definition demonstrates the predictive power of the IGB and indicates that GSS and IGB, although correlated, contain some orthogonal information, which, if combined, can further improve treatment outcome prediction.  Table S5 and Table S6

Discussion
We have comprehensively analyzed factors of therapy outcome in the SHCS database using univariate, multivariate, and regularized multivariate logistic regression models. As predictors of therapeutic success we identified the applied drugs, the GSS, and as the strongest predictor the IGB to regimen, a novel predictor derived from viral genotype.
Including genotype information into treatment outcome prediction is challenging because of the large number of observed mutations and the complexity of the genotypephenotype relationship. Here, we have explored the IGB to drug resistance as a summary measure of the escape dynamics of the virus under treatment. The underlying idea of this modeling approach is that the IGB captures how difficult it is for the virus to escape from the selective pressure of individual drugs or from the entire drug combination. This piece of information is different from assessing the current genotypic or phenotypic drug resistance state of the virus, as intended, for example, by the GSS. The IGB makes a prediction about the expected escape dynamics of the virus population given its current genetic state.
The computation of the IGB involves an evolutionary model of genetic progression under selective drug pressure along multiple mutational pathways and a notion of evolutionary escape, which was based here on the predicted level of phenotypic drug resistance. We applied I-CBN models for jointly describing genetic progression and associated phenotypic change of the virus. In particular, phenotype predictions are non-linear in the mutations, which allows for capturing epistatic effects, i.e., the same mutation can have different effects on the resistance phenotype depending on the genetic background of the virus (Figure 1). The I-CBN models were estimated from independent genotype-phenotype data. Using these models, the complex, high-dimensional, genotypic data of each virus can be summarized efficiently by the IGB to resistance to each drug. Thus, rather than modeling interactions between drugs and individual mutations, the IGB provides a comprehensive model of drug-genotype interaction.
In the present study, we have extended the concept of the IGB to the entire regimen in a fashion that allows for computing this quantity for any drug combination and hence for large clinical datasets. The IGB to regimen can be regarded as the expected number of active drugs in the regimen. Assuming independent effects among drugs, we compute the regimen IGB from the drug IGBs. These simplifying assumptions are made for computational feasibility. They present a conceptual limitation of the approach and more elaborate models are conceivable. In addition, other variables not included in this study might be important, for example, pharmacological properties of drug combinations and host genetic factors. Here, the IGB, a single interpretable quantity, was found to be the strongest genotype-derived predictor of virological response and hence the most efficient representation of the viral genotype with respect to therapy outcome.
We have used throughout two definitions of virological success of treatment, namely reduction of viral load below 50 cps/ml and below 400 cps/ml. The latter less stringent cutoff was included because in the past it represented the limit of detection of viral load assays. Today viral load values of 50 cps/ml and lower can be measured and reduction below 50 cps/ml (or below the limit of detection) is an accepted therapeutic goal. We generally found very similar results for the two datasets, but the advantage of using IGB over GSS (the de facto standard genotype interpretation tool) reached statistical significance only for 400 cps/ml, but not for 50 cps/ml. This finding may, in part, be due to the larger dataset and hence increased statistical power for 400 cps/ ml as compared to 50 cps/ml. In the future, larger datasets will be required to further evaluate the IGB and its potential to predict treatment outcome without the need for expert rules. This property of the IGB is particularly appealing for new drugs, for which reliable rules are not readily available before evidence has accumulated in published studies. Larger datasets and more elaborate statistical variable importance methods [48] will also increase the power to detect other factors of therapeutic outcome, but the general consistency between the 50 cps/ml and 400 cps/ ml success definitions suggests that a sizable fraction of important variables have been identified. In addition, larger TCE databases will allow for analyzing alternative endpoints, such as time to virological failure or virological response after a fixed period of time.
In the univariate analysis, most drugs had a positive effect on treatment outcome, with the exception of ZDV, d4T, 3TC, and NFV. The negative associations might be due to the prominent use of the drug combinations (ZDV or d4T) +3TC+ (IDV or NFV), 90% of which were failures. The four drugs were among the first to be approved for antiretroviral therapy and used in early suboptimal regimens. Moreover, they were poorly tolerated and therefore one can expect a general lower adherence to treatment. A similar observation was made in the multivariate analysis, where ZDV, ddI, SQV, 3TC and d4T were significant predictors decreasing the odds of therapeutic success. This effect might also be due to the common early use of these drugs in mono therapy and their later use in salvage regimens, even if multiple resistance mutations had already accumulated [49]. Among PIs, a pronounced trend was that boosting with RTV increased the odds of successful treatment. The fraction of PI boosting in the dataset is reported in Supporting Table S4.
A few variables did not show significant association with therapy outcome although they might have been expected to. For example, adherence is a well-known predictor of treatment success [50,51], but it failed to reach significance in the multivariate model, most likely due to lack of adherence data for about 45% of the patients. The missing data resulted from collecting adherence data within the SHCS only since January 2003. Indeed, in a multivariate analysis restricted to the subset of 1183 TCEs with observed adherence a more pronounced effect can be observed. We have not included a set of variables in this study that are known to be predictors because of the construction of the dataset. The definition of the dataset of genotype-therapy pairs allows for including several sequential TCEs from the same patient. Most TCEs are actually derived from unique patients, but some patients occur multiple times. Each TCE gives rise to two therapy cases, a failure, which had given rise to the switch, followed by a salvage regimen, which can be a failure or a success. Therefore, we did not include variables that are affected by the sequential ordering of therapies, such as the total time a patient was under therapy with a certain drug or the calendar year of treatment.
In summary, the IGB to regimen is a new predictor of treatment outcome that captures, in a single quantity, the virus' genetic potential for developing drug resistance under the selective pressure of the combination therapy. The IGB can be computed efficiently for any viral genotype and any drug combination. It may thus contribute to improved interpretation of genotypic drug resistance tests and to the rational design of individualized therapies. Future prospective studies are required to apply these results to other patient populations and to eventually integrate them into clinical practice.

Swiss HIV Cohort Study (SHCS) database
Founded in 1988, the SHCS is a nationwide, prospective, multicenter, clinic-based cohort with continuous enrolment and semi-annual study visits representing approximately 50% of all HIV-infected and 75% of all treated patients in Switzerland [46]. The SHCS has been approved by ethical committees of all participating institutions, and written informed consent has been obtained from all participants. The SHCS drug resistance database contains the results of 13,201 genotypic resistance tests from 9,231 patients, stored in a central database [45]. Resistance data stem from routine clinical testing (60%) and from tests performed retrospectively from frozen repository plasma samples (40%) (Table 1 and 2).

Treatment change Episode (TCE) data
TCEs were obtained from the SHCS database as follows. Each TCE consists of a failing therapy followed by a salvage therapy (Supporting Figure S1). We required that the failing therapy was at least four month long and that the genotype was measured no more than 90 days before and no more than 30 days after onset of the uninterrupted salvage therapy [26]. In order to restrict to failing regimens due to viral rebound and to exclude convenience treatment changes or single determinations of low-level viremias (blips), a failing therapy was defined by either two consecutive viral load measurements above 500 cps/ml, or a single viral rebound followed by therapy switch, or single rebound after 180 days and lack of viral suppression below the limit of detection. Therapies were labeled 'success' versus 'failure' as follows. Any failing therapy was considered a failure. Salvage therapies were considered successful, if viral load dropped below 50 cps/ml at any time point during treatment, otherwise they were considered failures. Because viral load assays with a sensitivity of 50 cps/ml were not available for the whole observation period, we also considered an alternative definition of therapy success as a viral load reduction below 400 cps/ml. The TCE dataset spans the time period 1988-2010, but 75% of TCEs date from 2000 or later.

Isotonic Conjunctive Bayesian Network (I-CBN) models
Genetic progression of the virus under selective drug pressure and the resulting phenotypic drug resistance changes were modeled jointly using I-CBNs [44]. In this model, mutations occur subject to partial order constraints which define the genotype lattice, the set of genotypes compatible with the constraints, and drug resistance is non-decreasing along any mutational pathway (Figure 1). Formally (see [44] for details), let (E,[) be a partially ordered set of n mutations. Each genotype is identified with the subset g(E of mutations it carries. The genotype lattice G induced by (E,[) is the set of all genotypes g for which it holds that e[g implies e'[g whenever e'[e in E. We denote by Exit [ (g) the set of accessible mutations from genotype g under the given partial order constraints. The I-CBN is a statistical model for the random variables X [2 E^f 0,1g n , describing observed genotypes, and Y [R, describing associated drug resistance phenotypes, both of which are observed from true hidden genotypes Z[G(2 E subject to noise. The probability of an unobserved genotype Z is defined as where the parameters h e denote the conditional probabilities of mutation e[E given that all of its predecessor mutations have occurred, h e~P e[ZjVe 0 [Z, e 0 ve, e 0 =e ð Þ . The observed random variables X and Y are independent given Z. The genotype observation error is modeled as where d denotes the Hamming distance and errors are assumed to occur independently among sites at rate e. The observed drug resistance phenotype Y is the log fold-change in susceptibility. For each genotype Z, it follows a normal distribution subject to the monotonicity contraints m g ƒm h for all genotypes g(h. The complete model for X and Y is then the marginalization Parameter estimation for this model was performed using the EM algorithm described in [44]. The model was applied separately to 18 antiretroviral drugs, using between 280 and 2303 (median 1448) cross-sectional genotype-phenotype pairs, i.e., observations of (X ,Y ), obtained from the Stanford HIV Drug Resistance Database, restricted to subtype B sequences and to Phenosense or Antivirogram assays [52]. For each drug, we selected its resistance-associated mutations reported on the Stanford HIVdb website lumping together mutations occurring at the same site, or if unavailable, applied L 1 -penalized (lasso) linear regression [53,54] to select from all PR or RT mutations occurring at least ten times a sparse set E of n~10 predictor mutations. The performance of the models is reported as the Pearson correlation coefficient between true and predicted phenotypes, estimated from a separate, random subset of 20% of the data. Phenotypic cutoff values were derived from the distribution of fold-change values as described previously [15,26] and used to dichotomize resistance predictions (Supporting Table S2).

Individualized Genetic Barrier (IGB)
Given an I-CBN model, transition probabilities among genotypes g, h[G can be computed as Using these transition probabilities and the predicted drug resistance phenotypes m g , we define the IGB of genotype g[G to resistance to drug d as the probability of the virus not reaching any genotypic state predicted as resistant, where G d 5G is the subset of all genotypes g predicted to be resistant to drug d, i.e., for which m g is greater than the resistance cutoff (Supporting Table S2). Genotypes outside the lattice G (not complying with the partial order constraints) are regarded as erroneous observations of the genotypes in the lattice. The IGB of such a genotype f is where P(gDf ) is the probability of the actual genotype being g given that f has been observed. By Bayes' theorem, where P(f Dg)~(1{e) n{d(f ,g) e d(f ,g) is modeled as in Eq. 2.
The genetic barrier to escape from a regimen R is defined as the sum of the drug-specific barriers over all drugs in the regimen Because the IGB to each drug can be regarded as an estimate of the activity of the drug (the probability of not escaping), the IGB to a regimen may be interpreted as the expected number of active drugs in the regimen. Note that 0ƒIGB R (g)ƒDRD, that IGB R (g)&0 means that evolutionary escape is almost certain, and that adding a drug to a regimen can only increase the genetic barrier to the regimen.

Statistical analysis
For classifying therapies as failures versus successes, univariate, multivariate, and regularized multivariate logistic regression was used. For a set of precitors x 1 , . . . ,x m , the therapeutic success probability p is modeled by the regression where b j are the regression coefficients. The odds ratio of therapeutic success associated with a one-unit increase in predictor j is e b j . P-values for the predictors are corrected for multiple testing using the Benjamini-Hochberg procedure. For regularization, we applied the elastic net [55], which combines an L 1 (lasso) penalty encouraging sparse solutions with an L 2 (ridge) penalty that tends to average across correlated features. Classifier performance was evaluated using ROC curves and is reported as the area under the ROC curve (AUC). The data was ten times randomly split into 40% for estimation of the two hyperparameters (one for the degree of each type of regularization) and 60% for model fitting and testing, which was done by 10-fold cross-validation [56]. The R language for statistical computing (http://www.r-project. org/) was used for all analyses, including the R packages icbn, glmnet, and ROCR. An R script for computing the IGB is available at: http://www.cbg.ethz.ch/software/igb. The Stanford HIVDB Sierra web service was used for GSS computation.

Supporting Information
Figure S1 Treatment change episode (TCE). Each TCE consists of a failing therapy followed by a salvage therapy. The failing therapy gives rise to a failure, whereas the salvage therapy can be either a success or a failure, depending on whether viral load suppression below 50 cps/ml (400 cps/ml) was achieved during treatment or not (see Methods). Genotypes are measured prior to or at the beginning of the salvage regimen.  Figure S22 Univariate analysis of predictors of response to antiretroviral combination therapy in the SHCS database. Associations have been tested using logistic regression models and odds ratios of therapeutic success, defined as viral load reduction below 50cps/ml (A) and 400cps/ml (B), are reported together with their 95% confidence intervals on a logarithmic scale. Benjamini-Hochberg-corrected p-values are represented as black (pv0:001) and grey (pv0:01) symbols. Only predictors with a p-value smaller than 0.01 are included.

(EPS)
Table S1 Complete list of all variables analyzed with respect to treatment outcome. Groups NRTI, NNRTI, and PI consist of binary variables, one for each drug, indicating the presence of the respective drug in the regimen. For PIs, boosted (given together with low-dose RTV) and unboosted formulations are distinguished, except for LPV which is always applied boosted. The variable RTV refers to the use of ritonavir as the only PI in the regimen. Demographic and clinical variables include age and gender of the patient, whether he or she had AIDS, the maximum viral load and the minimum CD4 T cell count measured anytime before treatment onset, transmission group (BLOOD, HET, IDU, MSM, or OTHER), and adherence. Patient adherence was assessed in questionnaires and measured as the percentage of missed dosages [50,51] for 1183 (45%) of the patients, and then dichotomized. For the multivariate analysis only, unobserved values of patient adherence were imputed by a logistic regression model (one for each dataset) from all remaining variables except the response (treatment outcome). For each drug, the individualized genetic barrier (IGB) is the probability of the virus not escaping from the selective pressure of the drug. The IGB to regimen is defined as the sum of the drug-specific IGBs over all drugs in the regimen. Mutations in the PR and RT of HIV-1 are denoted by the sequence position followed by the amino acid. Each variable is binary indicating the presence of the respective amino acid at the respective position in the protein. Only mutations that occurred in at least 5% of the samples are considered. (PDF) Table S2 Construction of I-CBN models. For each drug, is reported the number N of genotype-phenotype pairs the model has been learned from, the correlation coefficient R between predicted and true drug resistance phenotypes, the list of selected mutations, and the cutoff value C defining resistant versus susceptible viruses. The correlation coefficient has been estimated from an independent test set consisting of 20% of the data that was not used for training. For ZDV, DDI, D4T, 3TC, ABC, TDF, FTC, EFV, NVP, SQV, IDV, NFV, LPV, and TPV, the corresponding drug resistance-associated mutations reported on the Stanford HIV Drug Resistance Database website were used, while for DDC, RTV, APV, and ATV, we selected ten mutations using L1-penalized linear regression (lasso). (PDF) Table S3 Different categories of drug combinations in SHCS databse. The first category includes drug combinations currently recommended as first-line or alternative regimens according to the JAMA recommendations [42]. Category 2 includes regimens that were recommended as first-line or second-line regimens in the past, regimens that are still in use in developing countries or are used sometimes if drug resistant virus is present at baseline, or salvage regimens. Category 3 includes older regimens that are not in use anymore as first-line regimens but were before, regimens that are not corresponding to guidelines, including those that are sometimes used in special circumstances, such as unusual tolerability, etc. To evaluate the prediction performance (sensitivity and specificity) of each category, leave-one-out cross-validation experiments were performed. (PDF) Table S4 PI usage and boosting fraction. Reported is the total number of regimens in the SHCS database that include the respective PI, and in parenthesis, the percentage that the PI is boosted, i.e., given together with low-dose ritonavir (RTV). (PDF) Table S5 Comparative performance in predicting treatment outcome, defined as a reduction of viral load below 50cps/ml, for different elastic net regularized logistic regression models. Comparative performance in predicting treatment outcome, defined as a reduction of viral load below 50cps/ml, for different elastic net regularized logistic regression models. In columns 3-8, the p-value of a two-sided Wilcoxon rank sum test for differences in the area under the ROC curve (AUC; column 2) is reported. Prediction models (column 1) are encoded by the sets of predictors used, where C refers to the demographic and clinical variables, D refers to drugs, and M to mutations. For example, the model IGB+CDM includes as predictors IGB to regimen, clinical and demographic predictors, applied drugs, and mutations. (PDF) Table S6 Comparative performance in predicting treatment outcome, defined as a reduction of viral load below 400cps/ml, for different elastic net regularized logistic regression models. Comparative performance in predicting treatment outcome, defined as a reduction of viral load below 400cps/ml, for different elastic net regularized logistic regression models. In columns 3-8, the p-value of a two-sided Wilcoxon rank sum test for differences in the area under the ROC curve (AUC; column 2) is reported. Prediction models (column 1) are encoded by the sets of predictors used, where C refers to the demographic and clinical variables, D refers to drugs, and M to mutations. For example, the model IGB+CDM includes as predictors IGB to regimen, clinical and demographic predictors, applied drugs, and mutations. (PDF)