Table 1.
Whole-blood microarray profiles used in this study for training, testing, and validating signatures, as shown in Fig 1. Sample counts are shown for each dataset stratified by HIV and TB status.
Fig 1.
Illustrates the workflow and whole-blood microarray datasets used to train, evaluate (using leave-one-out cross validation, LOOCV) and test predictive models. Each box represents a dataset, identified by GEO ID and first author name. Bullet points list the available geographical sites and TB and HIV status available for samples making up each dataset. Sample counts for each site are also shown in Table 1.
Table 2.
Models trained in the manuscript are named in the form: <classes-predicted>.<algorithm>.<number-of-probes>, where classes-predicted is one of the options specified under Model Complexities and algorithm is one of the options specified under Model Algorithms. Thus, the model named six.rf.25 indicates a six-class multinomial random forest model based on 25 microarray probes. For comparison, two external models have been included, as indicated under External Models.
Fig 2.
Training cross-validation results on adult TB samples.
Leave one out (LOOCV) areas under the receiver operating curves (AUCs) for models on the South Africa adult data. Each panel plots the AUC curve for six machine learning algorithms (glmnet: Elastic-Net logistic regression, knn: k-Nearest Neighbors, nnet: Neural Network, rf: Random Forest, svmRadial, Support Vector Machine with Radial Basis Function kernel; xgbTree: Extreme Gradient Boosting) starting with models trained using all 554 probes, and iteratively shrunk to models trained on 10 probes only. Models were trained to classify the data into 6 (TB:HIV+, TB:HIV-, LTB:HIV+, LTB:HIV-, OD:HIV+, OD:HIV-), 4 (TB:HIV+, TB:HIV-, LTB:HIV+, LTB:HIV-) and 2 (TB, LTB) classes. Two types of 2-class models were trained: using either HIV+ or HIV- samples. Error bars show bootstrap-estimated 95% confidence intervals around the AUC.
Table 3.
Training cross-validation results.
Leave one out cross-validation (LOOCV) Areas under the receiver operating curves (AUC) and 95% confidence intervals (CI) for the best 10-probe model for each combination of machine-learning algorithm and number of predicted classes for models evaluated on the South Africa adult data (Fig 2).
Fig 3.
A six-class multinomial model optimally predicts 4 independent test sets.
ROC curves for active TB vs non-TB classification of independent test sets. Legends shows the AUC for each model, with the 95% confidence intervals in parentheses. Models developed in this study are named in the form <number-of-classes>.<algorithm>.<number-of-probes>. E.g. six.rf.10 is the 10-probe random forest model trained to predict 6 classes. twoneg and twopos refer to 2-class models trained on HIV- or HIV+ samples respectively. threeGene refers to the signature described by Sweeney et al [23], and ACS refers to the signature described by Zak et al [25]. A ROC curves for classification of the Malawi test samples from the Kaforou cohort. B ROC curves for Malawi test set plus the three further independent test sets described in Table 1.
Fig 4.
The six-class multinomial model identifies HIV+ TB as a distinct state.
A ROC curves for the 10-gene six-class multinomial model discriminating HIV+ active TB samples from HIV- active TB samples, and HIV+ active TB samples from HIV- active TB and HIV+/- LTB and HIV+/- OD samples in the Malawi test set. B Dot- and boxplots of expression levels of six-class multinomial model genes in the entire Kaforou dataset. C Six-class multinomial genes classified by their TB/HIV behavior as determined by fitting linear models to gene expression as a function of disease state. TB upregulated genes are indicated in orange and downregulated genes shown in blue.
Fig 5.
Dot and boxplots for each microarray primer, named as the corresponding gene, strongly correlated with LAG3 (spearman correlation ρ>0.8) for latent and active TB samples from the Kaforou dataset.