Interpreting tree ensemble machine learning models with endoR
Fig 2
endoR captures interactions predictive of an artificial phenotype from a random forest fitted on real metagenomes.
A-E: Real metagenomes with an artificial phenotype (AP): samples were separated into 4 groups (labelled a-d), a binary response variable (‘1’ = blue, ‘-1’ = yellow) was simulated so that it could be predicted from a set of decisions based on the ‘group’ categorical feature and specific, randomly chosen microbial abundance features (e.g., ‘Alistipes A’). Dashed grey lines denote thresholds in the predetermined decisions used to make the response variable and are described in Table 1 (e.g., the response variable is ‘1’ if samples belong to Group a and have non-null relative abundances of both Alistipes A and Marvinbryantia sp900066075). For samples in Group c, the response variable was built with an ‘OR’ rule (i.e., ‘Group = c & ((B. clarus >0 & Oscillibacter sp001916835 >0) | F. prausnitzii G >10−2)’), so each of the two sub-rules are shown in C and D. F: Ground truth network of features derived from the response variable generation procedure described in A. Pairs of variables predicting ‘1’ are linked by a blue edge (‘positive’) and those predicting ‘-1’ by a yellow edge (‘negative’). Variables for which high values are predictive of ‘1’ have a blue node color (‘positive’) and a yellow node color if high values are predictive of ‘-1’ (‘negative’). If high values are predictive of ‘1’ or ‘-1’ depending of other variable values (e.g., Group b predicts ‘1’ if V3 takes high values, but ‘-1’ if V3 has low values), the color is grey (‘depends’). G-H: Variable importances from the RF model as measured by the mean decrease in Gini impurity and endoR. Due to the feature selection step, the RF model was fitted on the 18 selected features shown on the y-axis; the feature importance of all other taxa can be considered null for both. The point color indicates whether the features were used to construct the response (‘True’) and those taxonomically related to them (‘closely related’), with ‘closely related’ defined as the immediate parent or child taxonomic classification in the taxonomy hierarchy (e.g., the Bacteroides genus is the child of the Bacteroidaceae family, while Bacteroidaceae is the parent of Bacteroides). I: Full decision network extracted by endoR from a RF model trained on the dataset described in A. Only the 20 features with the highest feature importance are labelled. The edge transparency is inversely proportional to the importance for I: only. J: Same network as shown in I, but edges with lowest interaction importance were removed to obtain paths between nodes of length ≤ 3. All features are labelled.