Interpreting tree ensemble machine learning models with endoR

doi:10.1371/journal.pcbi.1010714

Fig 1.

Description of the endoR method workflow.

A: General overview of the workflow from data acquisition to the visualization of a network. endoR is applied to a trained classification or regression tree ensemble model. The model is first simplified into a decision ensemble, which is used to calculate the feature importance and influence on predictions. The resultant metrics are displayed in a summary plot listing the feature importance and influence, and as a decision network. The decision network illustrates the association between the response and single or pairs of variables, in regards to feature importance and influence. If the influence of a variable depends on other variables, it will be visible in the network via edges between these nodes. B: Steps taken by endoR to generate a stable network. endoR accepts tree ensemble models that were made with the XGBoost, gbm, randomForest or ranger R-packages [43–46]. Regularization is optional and consists of simplifying decisions and the decision ensemble to reduce noise. The procedure can be repeated on B bootstraps to select stable decisions prior to constructing the final network.

More »

Expand

Fig 2.

endoR captures interactions predictive of an artificial phenotype from a random forest fitted on real metagenomes.

A-E: Real metagenomes with an artificial phenotype (AP): samples were separated into 4 groups (labelled a-d), a binary response variable (‘1’ = blue, ‘-1’ = yellow) was simulated so that it could be predicted from a set of decisions based on the ‘group’ categorical feature and specific, randomly chosen microbial abundance features (e.g., ‘Alistipes A’). Dashed grey lines denote thresholds in the predetermined decisions used to make the response variable and are described in Table 1 (e.g., the response variable is ‘1’ if samples belong to Group a and have non-null relative abundances of both Alistipes A and Marvinbryantia sp900066075). For samples in Group c, the response variable was built with an ‘OR’ rule (i.e., ‘Group = c & ((B. clarus >0 & Oscillibacter sp001916835 >0) | F. prausnitzii G >10⁻²)’), so each of the two sub-rules are shown in C and D. F: Ground truth network of features derived from the response variable generation procedure described in A. Pairs of variables predicting ‘1’ are linked by a blue edge (‘positive’) and those predicting ‘-1’ by a yellow edge (‘negative’). Variables for which high values are predictive of ‘1’ have a blue node color (‘positive’) and a yellow node color if high values are predictive of ‘-1’ (‘negative’). If high values are predictive of ‘1’ or ‘-1’ depending of other variable values (e.g., Group b predicts ‘1’ if V3 takes high values, but ‘-1’ if V3 has low values), the color is grey (‘depends’). G-H: Variable importances from the RF model as measured by the mean decrease in Gini impurity and endoR. Due to the feature selection step, the RF model was fitted on the 18 selected features shown on the y-axis; the feature importance of all other taxa can be considered null for both. The point color indicates whether the features were used to construct the response (‘True’) and those taxonomically related to them (‘closely related’), with ‘closely related’ defined as the immediate parent or child taxonomic classification in the taxonomy hierarchy (e.g., the Bacteroides genus is the child of the Bacteroidaceae family, while Bacteroidaceae is the parent of Bacteroides). I: Full decision network extracted by endoR from a RF model trained on the dataset described in A. Only the 20 features with the highest feature importance are labelled. The edge transparency is inversely proportional to the importance for I: only. J: Same network as shown in I, but edges with lowest interaction importance were removed to obtain paths between nodes of length ≤ 3. All features are labelled.

More »

Expand

Table 1.

Predetermined decision rules based on the making of the artificial phenotypes.

More »

Expand

Fig 3.

endoR’s performance is robust to hyperparameters and depends on the input model.

Simulation results based on 100 FSDs with n = 1000 observations (except when varied in E-F) and 50 APs using all observations (A-C). In all experiments, the noise was r = 0.05 (except when varied in B-C) and endoR was applied to fitted RFs with α = 5 (except when varied in A) and B = 10 (except when varied in D). For each dataset and parameter setting we fitted a RF and applied endoR. Then, we computed the following three metrics: Cohen’s κ of the RF, weighted precision and recall values of the selected edges in the stable decision ensemble, and TP/FP-curves based on the probabilities of being selected in the stable decision ensemble (see Methods). A and D: TP/FP-curves are averaged across all datasets for a fixed parameter setting (line) and standard deviation (shaded area) are displayed. The average number of TPs and FPs expected for a randomization null model and standard deviations, are shown in grey. Large points indicate the average number of TPs and FPs in the stable ensembles generated by endoR. B-C and E-F: Each point corresponds to the precision/recall of endoR applied to a single dataset and parameter setting. The larger traced points are the averages across all datasets for a fixed parameter setting. A: Increasing α increases both the TPs and FPs. Small values of α effectively control the FPs without strongly impacting the recovered TPs. D: Larger values of B are slightly better but endoR performs well even for small values of B. B-C and E-F: As expected decreasing the noise or increasing the number of observations improves the performance of endoR both in terms of precision and recall. Importantly, there is a strong dependence of endoR performance on the performance of the fitted RF. Moreover, endoR has a good precision even for small sample sizes.

More »

Expand

Fig 4.

endoR is better or comparable to state-of-the-art methods at identifying true variables and pairs of variables predictive of artificial phenotypes.

Average (line) and standard deviation (area) of identified true positive (TP) for a given number of false positive (FP). The average numbers of TP and FP in the endoR final decision ensemble are indicated with points. A, C: correspond to single variables and B, D: to pairs of variables across 50 replicates of artificial phenotypes. A, B: the truncated lines of absolute numbers of TP and FP are displayed, dashed grey lines denote the ground truth number of TP. C, D: the full curves of TP and FP rates are displayed. Lines are dashed when necessary due to overlaps. ‘Random’ signifies results expected with a randomization null model. A, C: All methods based on fitted predictive models almost perfectly ranked TP because of the feature selection step in model fitting. B, D: endoR better discriminated TP from FP edges than SHAP and lasso. Only endoR does not return all features and interactions, hence limiting the number of FPs in the final decision ensembles, although resulting in lower recall too.

More »

Expand

Fig 5.

endoR recapitulates previous findings on differences in gut microbiomes between healthy individuals and patients diagnosed with cirrhosis.

A: Feature importance aggregated across each level of discretized variables and influence per-level as determined by endoR. Levels correspond to discrete variable categories, and here represent relative abundance groups created by endoR (i.e., whether samples had ‘Low’, ‘Medium’ or ‘High’ relative abundances of each taxon). ‘closely related’ designates taxa that are the direct parent or child taxonomic classification of a taxon originally associated with disease status in Qin et al. [41]. White boxes in the influence plot signify that the level was not used in any stable decision; thus, the influence could not be calculated. B: Decision network extracted from the stable decision ensembles. See Fig 2 for the description of the network; the boxed legend is shared for A and B.

More »

Expand

Fig 6.

Relative abundances (RA) of Oscillospirales, Christensenellales and other select bacteria predict conditions favorable to colonization of the human gut by Methanobacteriaceae.

A: Feature importance and influence for each taxa used by the decision ensemble generated by endoR. Taxonomic levels are indicated with label prefixes: ‘f_’ = family, ‘g_’ = genus, and ‘s_’ = species, while taxonomic orders are indicated via bar and label colors. Levels correspond to ‘Low’ and ‘High’ relative abundances of taxa. B: Sum of gene copy numbers of marker genes involved in H₂ production and consumption (see Methods and S5 Table), for endoR selected features. SRB: dsrA and dsrB genes exclusively involved in sulfate reduction [82]; Acetogen: fhs gene involved in acetogenesis [83]; other categories correspond to hydrogenases predicted functions as determined by the HydDB database: H₂ production (H₂-prod.), H₂ uptake (H₂-upt.), sensory [84]. Boxes are white for taxa for which genes were not detected in their genomes. The cross indicates ‘Non applicable’ (for the ‘dataset_nameLouisS_2016’ feature). C-D: Effect sizes from gene set enrichment analyses performed at the gene function (C/) or for each gene (D/), bars are colored by the adjusted p-values (Adj. p). D: Bars are colored by gene function. The predicted O₂ tolerance of hydrogenases and electron (e^-) donor or acceptor are indicated by colored boxes on the right of the plot [84]. Asterisks denote significance (adjusted p-value < 0.05). E: Decision network in which nodes correspond to individual features and edges correspond to pairwise interactions. Nodes and edges colors describe the feature and interaction influence; their sizes and widths are proportional to their importances. Nodes with an importance ≥ than 0.3 but not connected are shown. Taxa with a gene copy number ≥ 30 for H₂ production and ≥ 20 for SRB genes are highlighted in yellow and green, respectively. The boxed legend applies to A, B, and E.

More »

Expand

Table 2.

Predetermined decision rules to generate the response variable from the simulated datasets.

More »

Expand