Feature selection and prediction of treatment failure in tuberculosis

Background Tuberculosis is a major cause of morbidity and mortality in the developing world. Drug resistance, which is predicted to rise in many countries worldwide, threatens tuberculosis treatment and control. Objective To identify features associated with treatment failure and to predict which patients are at highest risk of treatment failure. Methods On a multi-country dataset managed by the National Institute of Allergy and Infectious Diseases we applied various machine learning techniques to identify factors statistically associated with treatment failure and to predict treatment failure based on baseline demographic and clinical characteristics alone. Results The complete-case analysis database consisted of 587 patients (68% males) with a median (p25-p75) age of 40 (30–51) years. Treatment failure occurred in approximately one fourth of the patients. The features most associated with treatment failure were patterns of drug sensitivity, imaging findings, findings in the microscopy Ziehl-Nielsen stain, education status, and employment status. The most predictive model was forward stepwise selection (AUC: 0.74), although most models performed at or above AUC 0.7. A sensitivity analysis using the 643 original patients filling the missing values with multiple imputation showed similar predictive features and generally increased predictive performance. Conclusion Machine learning can help to identify patients at higher risk of treatment failure. Closer monitoring of these patients may decrease treatment failure rates and prevent emergence of antibiotic resistance. The use of inexpensive basic demographic and clinical features makes this approach attractive in low and middle-income countries.


Objective
To identify features associated with treatment failure and to predict which patients are at highest risk of treatment failure.

Methods
On a multi-country dataset managed by the National Institute of Allergy and Infectious Diseases we applied various machine learning techniques to identify factors statistically associated with treatment failure and to predict treatment failure based on baseline demographic and clinical characteristics alone.

Results
The complete-case analysis database consisted of 587 patients (68% males) with a median (p25-p75) age of 40 (30-51) years. Treatment failure occurred in approximately one fourth of the patients. The features most associated with treatment failure were patterns of drug sensitivity, imaging findings, findings in the microscopy Ziehl-Nielsen stain, education status, and employment status. The most predictive model was forward stepwise selection (AUC: 0.74), although most models performed at or above AUC 0.7. A sensitivity analysis using the 643 original patients filling the missing values with multiple imputation showed similar predictive features and generally increased predictive performance. PLOS  . There are many challenges for effective prevention, diagnosis, and treatment, with some of the most important being lack of funding, limited access to health resources (infrastructure, testing facilities, drug availability), stigmatization, poverty, and lack of compliance. Furthermore, these challenges are highly country dependent, explaining much of the regional variation. Over the last decade multidrug resistant (MDR) and extensively drug resistant (XDR) TB has become an important public and global health concern. Multidrug resistance is defined as resistance to at least rifampicin and isoniazid, while XDR refers to bacteria that are additionally resistant to any fluoroquinolone and at least one of three injectable second-line drugs (i.e. amikacin, kanamycin, or capreomycin) [2]. While the percentage of MDR/XDR is still relatively small globally, it varies significantly by region and country. In Belarus, the rate of MDR/XDR TB is estimated to be approx. 50/100,000 and has steadily been declining since 2010. Rates exceed 100/100,000 in Thailand and are increasing by approximately 4% annually [1]. Recent data by the WHO estimates the success rate of TB treatment to be approximately 83%, while the success rate for MDR TB is only 54% and even less (30%) for XDR. Promisingly, the introduction of shorter drug regimens has increased treatment success rates to approximately 90% In a recent publication, Sharma et al. modelled the proportion of TB patients likely to develop MDR/XDR and forecasted a steady increase for most countries over the next 25 years. Particularly alarming rates have been predicted for Russia, where MDR rates might exceed 30% by 2040 [3]. Long established clinical risk factors for treatment failure and subsequent development of MDR/XDR include the number of previous treatments, score on sputum smears, weight and imaging abnormalities (extensive lesions, cavities, mediastinal shift) [4]. More recently, research has focused on the prediction of resistance based on genomic sequencing data. Using whole genome sequencing, Walker et al. identified 120 mutations conferring treatment resistance and were able to predict 89.2% of the validation-set phenotypes with a mean sensitivity of 92.3% and a specificity of 98.4% [5]. Though more accurate than traditional sensitivity essays, it seems unlikely that this technique will be widely applied in low and middle-income countries due to limited resources.
For various conditions, demographics have been established as potent predictors of loss to follow-up [6] and various outcome measures [7]. Here, we aim to explore whether it is possible to predict treatment failure based on basic clinical and demographic information. Various machine learning models were developed to predict treatment failure without making prior assumptions to discover potential novel predictors of treatment failure.

Database
The National Institute of Allergy and Infectious Diseases (NIAID) Office of Cyber Infrastructure and Computational Biology (OCICB) established the TB Portals Program as a multinational collaboration for TB data sharing and analysis to advance TB research [8]. A consortium of clinicians and scientists from countries with a heavy burden of TB, especially drugresistant TB, work together with data scientists and IT professionals to collect multidomain TB data and make it available to the clinical and research communities. The majority of cases in the TB Portals is from drug-resistant TB. The TB Portals dataset includes clinical, imaging, genomic, and other metadata integrated in patient case records, which originate from countries around the world. This study analyzed data from Azerbaijan, Belarus, Moldova, Georgia, and Romania (Fig 1). The main descriptive features of the database are presented in Table 1.
In this study we used data from the TB portals (https://depot.tbportals.niaid.nih.gov), which has been stripped of all individual identifying information and has been assigned random, unique codes. While the database continues to be actively populated with participants, we used all available data as per March 2018. NIH safeguards the confidentiality and privacy of the individuals whose data have been deposited into the TB Portals. The primary outcome was treatment failure, which we defined as failure of therapy or death. The variables used for prediction were: country, age of onset, sex, education level, employment status, number of daily contacts, type of resistance,  body mass index (BMI), localization in the lung, number of X-rays, number of CT scans, dissemination (diffuse pulmonary nodules detected), size of the lung cavity, pleural involvement, imaging pattern, pneumothorax, pleuritis, nodal calcinosis, process extension, decrease in lung capacity, lung caverna, culture results, microscopy results, social risk factors (including smoking, alcoholism, ex-prisoner, etc.), and drug regimen. Imaging features were based mostly on chest CT scans, except for Georgia, where features were based mostly on chest X-rays.

Dataset division
We randomly split the original dataset into 70% of patients for a training subset and 30% for a testing or validation subset. The clinical characteristics of the training and testing subsets were similar (S1 Table). Predictive models were developed using only the training subset data and, only after all predictive models were developed, were they tested on the testing data subset. The performance of the models was optimized with cross-validation in the training subset before implementing them in the testing subset. This approach ensures that the performance of the algorithms in the testing subset is similar to their performance in other data which the models have never been exposed to.
We performed a complete case analysis with only the cases that had all variables available. As a sensitivity analysis, we performed the same analysis with all cases after imputing missing values with multiple imputation using predictive mean matching. The limits between statistics and machine learning are difficult to delineate [9]. For the purpose of this study we considered machine learning techniques as those that developed models based only on the data available without any human intervention. We used several of the most commonly used techniques for machine learning: stepwise forward selection, stepwise backward elimination, backward elimination and forward selection, Least Absolute Shrinkage and Selection Operator (LASSO) regression, random forests, support vector machine (SVM) with linear kernel and polynomial kernel [10,11]. The stepwise selection or elimination models select variables based on a greedy algorithm that tries to minimize the Akaike Information Criterion (AIC) and, therefore, balances predictive power with model complexity [10,11]. LASSO regression also penalizes model complexity but tends to shrink less important variables towards zero, so that the resulting model is sparse and more interpretable. Random forests fits several de-correlated decision trees to the data and "averages" them so that the resulting model has as little bias and as little variance as possible [10,11]. With de-correlation only a random subset of all available variables is used for each given branch, ensuring that the individual trees are heterogeneous and capture as much variability in the data as possible. As observations might not be separable in their original dimensional space, support vector machines separate the observations in a higher dimensional space that they visit through a kernel [10,11]. Regarding parameter tuning, forward stepwise selection, backwards stepwise elimination, and backwards elimination with forward selection did not require parameter tuning and just minimized the AIC. In LASSO regression, the lambda value that shrinks small coefficients to zero was chosen by cross-validation in the training set [10,11]. For random forests, we selected 1000 trees, a commonly used default, and the default leaf size of 5 in the randomForest function in R. We selected the number of variables randomly sampled as candidates at each split (mtry parameter in the randomForest function in R) by cross-validation in the training set. In SVM, the cost parameter quantifies the cost of a violation of the margin: when the cost is small the margins will be wide and many support vectors will be on the margin or will violate the margin, in contrast, when the cost is large the margins will be narrow and there will be few support vectors on the margin or violating the margin [11]. The cost was chosen by cross-validation in the training set. The gamma parameter for non-linear kernels in SVM is the free parameter of the Gaussian radial basis function. Large gamma leads to high bias and low variance models and vice versa. We used the default value of 1/data dimension in the SVM function in R for the gamma parameter. In summary, we used defaults for the most established parameters and tuned the rest of the parameters with cross-validation in the training set. Predictive performance was measured with the area under the receiving operating characteristic curve (AUC) in the testing model. The thresholds for prediction in an AUC are domain-specific and depend on the trade-off between the consequences of false positives and false negatives. We calculated the statistically "optimal" cut point in the probability of treatment failure: the Youden index, which maximizes sensitivity-(1 -specificity) [12]. Based on these thresholds, in order to give an approximate idea of clinical applicability, we provide sensitivities, specificities, positive predictive values (PPV), and negative predictive values (NPV) in our cohort. The full code for performing this study with full results is available on github (www.github.com/dsasson48/ niaid_TB).

Demographic and clinical features
The complete-case analysis database consisted of 587 patients (68% males) (411 for training and 176 for testing) with a median (p25-p75) age of 40 (30-51) years. Belarus contributed a major proportion of cases. Treatment failure occurred in approximately one fourth of the cases. The demographic and clinical features are summarized in Table 1.

Feature selection
Features that were statistically significantly most associated with the outcome of treatment failure are summarized in Table 2. The patterns of drug sensitivity, imaging findings, drug regimen, education and employment were associated with the outcome of treatment failure. As an illustrative example, the importance of variables for the random forest model is shown in Fig 2.

Prediction performance
We compared the prediction performance of the different machine learning models in their ability to predict treatment failure. The most predictive model was forward stepwise selection, Table 2. Main factors correlated with treatment failure. LASSO: Least absolute shrinkage and selection operator.

Model
Top factors correlated with treatment failure Forward stepwise selection Negatively correlated: drug sensitivity (sensitive), employment status (employed), microscopy: 1 to 99 acid-resistant bacteria in 100 fields of view when stained by Ziehl-Nielsen, dissemination (diffuse pulmonary nodules detected)

Backwards elimination and forward stepwise selection
Negatively correlated: employment status (employed), drug sensitivity (sensitive), dissemination (diffuse pulmonary nodules detected), microscopy: 0 acid-resistant bacteria in 100 fields of view when stained by Ziehl-Nielsen

LASSO
Positively correlated: country (Georgia), employment status (unemployed), extrapulmonary localization, lung cavity size (more than 25mm), decrease in lung capacity, microscopy: more than 99 acid-resistant bacteria in 100 fields of view when stained by Ziehl-Nielsen Negatively correlated: country (Romania), employment status (employed), employment status (student), drug sensitivity (sensitive), pattern of chest imaging (not reported)

Random forests (see Fig 2)
Unknown whether positively or negatively correlated: By mean decrease in accuracy: lung cavity size, type of resistance, employment status, country, total caverna By mean decrease in Gini index: age of onset, drug regimen, lung cavity size, number of daily contacts https://doi.org/10.1371/journal.pone.0207491.t002 although most models performed at or above AUC 0.7 (Table 3, Fig 3). If predictive models are used as a screening tool for identifying subjects that may benefit from more detailed evaluation of resistant TB, forward stepwise selection was the best model with a sensitivity of 0.36 and a NPV of 0.81. If predictive models are used to start treatment in resistant TB, the best model was LASSO with a specificity of 0.96 and a PPV of 0.64. Depending on the intended use of predictive models, modifications of the thresholds may provide higher sensitivities and specificities that meet the intended use of the model.

Fig 2. Variable importance in random forests considering mean decrease in accuracy (left) or mean decrease in Gini index (right).
In random forest models, node heterogeneity can be measured as a decrease in classification accuracy over all out-of-bag validated predictions when a variable is permuted after training and before prediction (mean decrease accuracy) or it can be measured as a decreased in node impurity (mean decreased Gini). The term "decrease" in these metrics does not imply any direction of the correlation. https://doi.org/10.1371/journal.pone.0207491.g002

Sensitivity analysis
To evaluate the robustness of our findings, we performed the same analyses in the dataset considering all original patients and imputing missing values with multiple imputation. This database of 643 patients (450 patients in the training set and 193 patients in the testing set) showed similar demographic and clinical features than the complete case analysis dataset. The factors consistently associated with treatment failure persisted to be the patterns of drug sensitivity, imaging findings, drug regimen, education and employment and some other factors like culture findings, or number of daily contacts. The predictive performance in the imputed dataset was generally superior than in the complete case analysis dataset, suggesting that larger datasets may improve prediction power (S2 Table and S1 Fig).

Discussion
This study identified patterns of drug sensitivity, imaging findings, findings in the microscopy Ziehl-Nielsen stain, education status, and employment status as the top factors correlated with treatment failure. In addition, our models based on basic demographic and clinical features predicted treatment failure with an AUC at or above 0.7, with the best-performing model having an AUC of 0.74. The first to use clinical and demographic data for the prediction of tuberculosis treatment were Keane et al. in 1997, who compared 130 Vietnamese subjects failing to respond to 673 responders. Hereby, they identified X-ray signs and degree of sputum smear positivity as the strongest predictors of treatment failure (70% sensitivity, 80% specificity) [4]. In 2002, studying 676 patients from Southern India, Santha et al. identified irregular treatment, being male, alcoholism and history of previous therapy to be associated with treatment failure [13]. More recently, Huang and colleagues reported extensive lesions, cavities and mediastinal shift to have the strongest association with treatment failure in their cohort [14]. Furthermore, they established age above 60 years and a history of smoking to be predictive. Concurrently, a case series of 167 TB patients in Iran identified coinfection with other pathogens, male sex, family history of TB, Turkoman ethnicity, and household size to be associated with treatment failure [15]. Of note, most studies reported different factors to have the strongest predictive performance. These differences are likely due to considerable heterogeneity in the underlying databases, regional population demographics and divergent analysis approaches. Furthermore, some studies were relatively small, thus allowing factors to become strong predictors by chance.
The first to employ a logistic regression model for prediction were Kalhori et al. in 2008, which, using clinical data, achieved an AUC of 0.70 [16]. Similarly, Rodrigo et al. report a predictive scoring instrument for tuberculosis lost to follow-up with an AUC of 0.67 (95%CI 0.65-0.70) [17]. Recently, Cherkaoui et al. conducted a prospective study employing questionnaires in rural Morocco. In the post-hoc analysis, the authors developed a scoring system with a high AUC (0.93, 95%CI 0.90-0.96), which was however not been validated in either a validation arm or external dataset and is therefore likely to be an overestimate of the true predictive power [18]. Though our dataset is relatively small, our unsupervised machine-learning approach still yielded a predictive model that outperformed previous ones.
Apart from emphasizing the importance of several well-established microbiological and imaging factors, our results add to the existing literature by identifying demographic and clinical factors with a strong and consistent statistical association with treatment failure or death, such as level of education and employment status. These results come from the application of machine learning techniques on data to discover associations that might have been overlooked by hypothesis-driven approaches. At the same time, both level of education and employment status are appealing predictors to treatment response, since they have been shown to affect health in various ways, including but not limited to health care seeking, treatment initiation, coherence and response in various other diseases [19][20][21][22]. Furthermore, the efficiency of TB treatment provision depends on national policies and health care systems, thus explaining why country is a strong predictor in our model. Number of daily contacts is strongly correlated with health care seeking behavior and treatment complications, all of which have been shown to predict outcome [23].
The identification and validation within a predictive model allows for easy and cheap identification of patients at high risk for treatment failure and can be used as a starting point for a patient-centered approach to resistance prevention.

Machine learning
Learning from data is typically used in situations with no theoretical or prior knowledge solution, but where data can be used to construct an empirical solution [10]. Currently most regression models in medicine consist of multivariable linear or logistic regression models constructed based on biological plausibility and prior medical knowledge. These models emphasize causality and have the major advantage of simplicity and interpretability: it is clear from the model which variables influence the outcome and their relative weights. In contrast, the more complex the machine learning models, the less intuitive interpretation of results [9]. In this study, we show that machine learning algorithms provided additional information as compared to explanatory models in prior literature based on prior medical knowledge. Furthermore, it had a higher AUC than any previous studies relying on clinical or demographic data.
Recently, many specialties including public health, are experiencing an increase in multicenter collaborations that yield datasets with large numbers of variables collected on numerous patients. The rapid increase in the number and quality of large datasets makes it possible to implement big data approaches to data analysis in the field of public health and TB.
Recently, a wide and deep neural network used targeted or whole genome sequencing and conventional drug resistance phenotyping data from 3,601 M. tuberculosis strains, 1,228 of which were multidrug resistant, to predict phenotypic drug resistance to 10 anti-tubercular drugs [24]. On an independent validation set, this wide and deep neural network showed significant performance gains over baseline models, with average sensitivities and specificities, respectively, of 84.5% and 93.6% for first-line drugs and 64.0% and 95.7% for second-line drugs [24].
In this publication, we relied on clinical data rather than genomic data, since gene sequencing is expensive and not widely available in low and middle-income countries. Contrarily, clinical data is easily accessible and can be used without incurring additional costs.

Strengths and weaknesses
The present study demonstrates that machine learning techniques add to explanatory models by identifying variables that may be correlated with TB treatment failure, thus informing better predictive models. All machine learning techniques, which have no subject-matter knowledge, identified relevant variables and showed a consistently high predictive performance.
The advantage of our model compared to pharmacological [25], microbiological [26] or genetic approaches [5,27] is that the required data is already been collected routinely, thus being widely available and cheap, which is a major advantage in resource-and budget-strained environments.
The clinical and public health contribution of this study are to provide low-and middleincome countries with inexpensive and easy to apply tools to identify populations where the yield of identification and treatment of resistant TB may be highest. Different stakeholders may build on our analyses, for which we provide the full code online.
Importantly, we do not intend to imply that factors correlated with resistant TB such as level of education, or level of employment are pathophysiologically causal in resistance to antituberculotic medications. Drug resistance is ultimately dependent on factors of the causal germ: Mycobacterium tuberculosis. A detailed microbiological and genomic characterization of the strain of Mycobacterium tuberculosis from each patient may predict resistant TB better [24], but is frequently not be feasible in low-resource setting. The orientation of this manuscript therefore was to find demographic and clinical features that are inexpensive to collect and can predict TB treatment failure. These factors may help identify inexpensively populations where the yield of detecting and treating resistant TB may be higher. The purpose of this manuscript was not to improve the performance of well-established algorithms in machine learning, but to show that different machine learning algorithms may be used to predict resistant TB using widely available data and off-the-shelf machine learning algorithms.
Missing data may have influenced the predictive capacity of our models. The sensitivity analysis imputing missing data showed an increased performance compared to the complete case analysis, although these results may also be interpreted as the higher predictive performance of a larger dataset. Furthermore, combining treatment failure and death might have biased the results towards the null since some cases might have died independent of tuberculosis (e.g. from traumatic causes). Another limitation is that the underlying data is derived from only a few countries in Eastern Europe.
Future studies in larger datasets should therefore validate our findings and prediction models for other countries. The current dataset was collected with a focus on resistant TB. Therefore, patients with resistant TB cases are oversampled in this dataset and generalizability might therefore be limited to countries with high resistance rates. In the future, addition of more variables and cases to the NIAID tuberculosis dataset might further improve predictive performance. Additional studies could focus on translation of these findings into a validated (online) prediction model that can be used by clinicians to identify patients at highest risk of treatment failure in routine practice.

Conclusions
In a mid-size dataset we found that machine learning techniques added information to explanatory models for prediction of resistance to TB treatment. The novel identification of demographic factors and validation of a prediction tool for TB treatment failure provides a proof-ofconcept that use of routinely collected data can help to improve routine care. This approach can potentially be more cost-effective in resource-constrained environments.
Supporting information S1 Table. Comparison of patient characteristics in the training and testing subset. IQR: interquartile range. (DOCX) S2 Table. Comparison of prediction performance of the different models in the imputed dataset. Predictive performance is higher than in the complete cases analysis (Table 3)