Fig 1.
An end–to–end gene signature model development scheme.
(A) The network–based meta–analysis approach. (A1) Schematic of differential gene expression calculations for a single cohort given a disease condition. (A2) Steps for integrating differential gene expression results for multiple cohorts into a gene covariation network. (A3) Four different networks were constructed corresponding to four different disease conditions: ATB v HC, ATB v LTBI, ATB v OLD, ATB v Tx. From these four networks, the top weighted nodes across networks were selected to form the common ATB–specific gene signature. (B) ML predictive model training and testing based on the common gene signature. (C) Steps for the predictive model validation.
Table 1.
Cohort lists used for model training and validation.
TB–related datasets and used for training and validation analyses, along with accession IDs, platforms, and which disease comparisons each dataset was used in (see S1 Table for full details including which datasets were used in each network, and details for the respiratory viral infection datasets for model validation). Abbreviations: Net/ML = Network model & ML training/testing; Val = ML validation.
Fig 2.
Network analysis results and gene signature.
(A–D) Network degree distribution fitting. The distribution of weighted degrees for each network was fit with a probability density function to help determine top–ranked genes to retain for the candidate gene set (Methods). (E) Overlap of top 5% of genes by weighted degree in each network. (F) Gene signature mapped on protein–protein interaction network. Protein–protein association network of 45 candidate genes constructed with STRING. Width of edge corresponds to edge confidence. Associations correspond to physical interactions or represent proteins that act functionally together. (G) A visualization of S2 Table, showcasing which pathways (and corresponding genes) were enriched for our 45–gene set (red: gene is in pathway, blue: gene is not part of pathway). (H–K) Volcano plots displaying the mean log2(Fold Change) of gene expression across cohorts against the weighted degree for each network for all genes within each network. Forty–five candidate genes highlighted in orange within each plot. (L–O) Heatmaps of log2(Fold Change) for each of the 45 candidate genes (rows) between disease conditions displayed for each cohort (columns) that were used in constructing each network. The corresponding GSE ID of each cohort was labeled on columns. If a cohort contains multiple clinically defined populations, a specific comparison is also highlighted on the column. Bolded gene names are genes that are also included in the reduced model, genes with an asterisk are not included in either the full or reduced model.
Fig 3.
Systematic model validation using TB progression cohorts.
(A) The distributions of TB scores generated by the full model, stratified by categorical interval to disease (datapoints n = 1281), and (B) clinically defined TB states (datapoints n = 137) are shown in the violin plot. Individual datapoints are plotted as a point in the violin plot. P–values were calculated using the Mann–Whitney U test with Bonferroni correction (ns, p < 1; *, p < 0.05; **, p < 0.01; ***, p < 0.001; ****, p < 0.0001). (C–D) Receiver operating characteristic curves depict diagnostic performance of the model for incipient TB, (C) stratified by time intervals to disease (< 3, < 6, <12, <18, < 24, < 30 months) and (D) mutually exclusive time intervals to disease (0–3, 3–6, 6–12, 12–18, 18–24, 24–30 months). (E) The distributions of TB scores generated by the full model, stratified by different TB disease stages (HC, LTBI, and ATB), other lung disease (OLD) and viral infection (VI), are visualized in the violin. (F) Area under the curve and 95% confidence intervals for each interval to disease are also shown. Diagnostic performance of the model in differentiating between ATB versus HC, LTBI, OLD, and/or viral infection (VI) using a pooled dataset of all 57 collected cohort studies (S1 Table) (datapoints n = 6290) are showed in ROC curves.
Table 2.
Prognostic performance of models stratified by inclusive time intervals.
Prognostic performance (AUROC with 95% confidence intervals) of the models developed in this report and published previously on the combined validation datasets for identification of incipient TB within a 2.5–year period, stratified by inclusive time interval to disease.
Table 3.
Prognostic performance of models stratified by exclusive time intervals.
Prognostic performance (AUROC with 95% confidence intervals) of the models developed in this report and published previously on the combined validation datasets for identification of incipient TB within a 2.5–year period, stratified by mutually exclusive time interval to disease.
Fig 4.
Building a probabilistic TB risk model.
(A) The distributions of the full model–generated TB scores, stratified by categorical interval to disease, and active TB, are shown in the violin plot, based on the pooled dataset collected from all the 37 studies. Individual datapoints of the dataset are plotted as a point in the violin plot (datapoints n = 3869). (B) The 6 probability functions (cumulative distribution function) that depict the probabilities of the outcomes (HC, LTBI, < 24m, < 12m, < 3m to disease, and active TB) as a function of TB scores were showed in lines with different colors. The shaded area represented the 95% confidence interval area.
Fig 5.
Model validation using the Catalysis TB treatment study.
(A) Comparison of TB scores generated by the reduced model at Day 0, 7, 28, and 168 after treatment initiation and from healthy controls are shown in a violin plot. Individual datapoints are plotted as a point in the violin plot (datapoints n = 395). (B) Receiver operating characteristic curves depict model discrimination between healthy controls and the patients from different time points after treatment initiation. Area under the curve and 95% confidence intervals for each interval to disease are also shown. (C) At each timepoint the TB scores were stratified by the time of sputum culture conversion to negative (negativity at day 28, 56, 84 and 168, and no conversion at day 168 [failed]) and plotted in the violin plot (datapoints n = 360). (D) ROC curves, stratified by different timepoints after treatment initiation, depict predictive performance of the models for discrimination between the patients with bacteriological cure and those with treatment failure at EOT. (E) Comparison of TB scores against treatment outcome from poorly compliant subjects at day 0. P–values were calculated using the Mann–Whitney U test (**, p < 0.01).
Fig 6.
Model validation using the Leicester TB treatment study.
(A) The distributions of the reduced model–generated TB scores, stratified by different time intervals after treatment initiation and healthy control, are shown in the violin plot. Individual datapoints are plotted as a point in the violin plot (datapoints n = 728). (B) Receiver operating characteristic curves depict model discrimination between healthy controls and the patients from different time intervals after treatment initiation. Area under the curve and 95% confidence intervals for each interval to disease are also shown. Comparison of TB scores between smear positive and negative patients at TB diagnosis (C), between the patients requiring standard and extended anti–TB treatment (ATT) at week 3–4 (D), month 2–3 (E) and month 4–6 (F) after treatment initiation are shown in the violin plots. P–values were calculated using the Mann–Whitney U test (ns, p < 1; *, p < 0.05; **, p < 0.01; ***, p < 0.001; ****, p < 0.0001). (G) Scatterplots show TB scores throughout treatment course, color–coded by the patients requiring standard or extended ATT. The line represents the median of the stratified group with 95% confidence interval around the median shown in the shaded area. (F) ROC curves, stratified by different timepoints after treatment initiation, depict predictive performance of the models for discrimination between the patients requiring standard and extended ATT.