Hollow-tree super: A directional and scalable approach for feature importance in boosted tree models

doi:10.1371/journal.pone.0258658

Fig 1.

Single binarized decision tree created in SKlearn to classify the Iris dataset.

(a) A single binarized decision tree was created in the sklearn DecisionTreeClassifier package to effectively delineate the Iris dataset. The depth four tree used Iris versicolor as the negative class (= 0), and Iris Virginica as the positive class (= 1). Iris Setosa was removed from the classification system to allow the binarization of the dataset. (b) We calculated the Gini Importance for our decision tree using sklearn’s “feature_importances_” property. This revealed ‘petal length’ to be the most important feature in the model for classification into the positive or negative class. (c) When performing permutation importance on our four features included in our simple decision tree, we found that permuting the values of petal width had the greatest impact on the model prediction error when attempting to classify data to the positive or negative class. Unlike Gini importance, this analysis revealed that petal length was also significantly important in the classification process, whilst sepal length and width again were found to be relatively unimportant in the decision making process, and permutation of these features did not greatly impact model prediction error. (d) A one feature partial dependence plot (PDP) for ‘petal length’ revealed that positive classification as Iris Virginica (partial dependence) was non-linearly related to ‘petal length’, with a critical point at 5cm marking certain negative (Iris Versicolor) classification. (e) By introducing ‘petal width’ and conducting a two feature PDP, we were able to determine that these two features were increasingly dependent for positive classification at values lower than 5cm and 1.6cm for petal length and width respectively, giving us directional and magnitudinal inferences between two features (For two feature PDP’s, a colour map is added to help visualise dependencies such that green indicates a greater partial dependency than purple).

More »

Expand

Fig 2.

Single decision tree linearized using the eli5 package.

By employing a general linear equation to define the relative contributions of each feature as decisions are made at an increasing depth in the tree, it is possible to derive feature importance values which lead to a positive (Iris Virginica) or negative (Iris versicolor) classification. This analysis offers a unique advantage over Gini importance, partial dependence, and permutation importance such that feature importance coefficients provide both a directionality and magnitude for each feature in delineating data into the positive and negative classes.

More »

Expand

Table 1.

Sample prediction and feature contribution score.

More »

Expand

Table 2.

[Iris dataset] model performance metrics.

More »

Expand

Fig 3.

Average feature contribution weight per prediction in a single decision tree.

After entering the Iris dataset into our linearized decision tree, we found the most important feature for successfully determining positive or negative class to be ‘petal length’. Importantly, the outputs provided in this analysis are given a magnitude and direction for their respective involvement in classification compared to other features. These metrics offer a significant improvement on previous analyses, whilst remaining consistent with the findings of Gini and permutation importance, where ‘petal length’ had the highest and second highest weightings respectively. Note that this represents the output data from a single decision tree (fold) prior to cross-validation.

More »

Expand

Fig 4.

Average feature contributions and count number over 5-fold cross validation.

(a) By investigating the five folds used for cross-validation shown in Table 3, we once again determined ‘petal width’ and ‘length’ to be the most important features for positive and negative classification within our decision tree. (b) A count of the number of folds each feature appeared in during the cross-validation process revealed that all four features appeared equally throughout the decision making process—suggesting that despite being significantly less important features for positive and negative classification than ‘petal length’ and ‘width’, sepal features are utilised equally regularly to make predictions within the model across multiple folds.

More »

Expand

Table 3.

5-fold feature contribution average for positive and negative classification of the Iris dataset using HOTS.

More »

Expand

Fig 5.

Predictive brain regions for passive/apathetic social withdrawal.

Applying our methodology to a cohort of 60 subjects from the Schizconnect COBRE dataset revealed brain regions most predictive for item N4 of the Positive and Negative Syndrome Scale (PANSS). Connectivity matrices were generated between 379 cortical and subcortical parcellations using a scheme derived from Glasser and colleagues (2015), and feature importance was carried out on connectivity measures extracted from each individual parcellation. After performing cross-validation and averaging the weights of feature importance, we determined R_POS2, R_IFSa, R_6mp, R_FST, R_AAIC, R_1, R_PFcm, R_PGs, and L_LIPd to be most predictive for positive classification of PANSS N4. Positive class indicates a score > 2 on item N4 of the PANSS. X-axis values are provided in log odds to more easily visualise the features of importance on a logarithmic scale. L_ = left side, R_ = right side, POS2 = area 2 of the parietal-occipital sulcus, IFSa = anterior inferior frontal sulcus, 6mp = medial posterior aspect of area 6, FST = lateral occipital visual area, AAIC = anterior agranular insular cortex, 1 = primary sensory area, PFcm = centromedian part of parietal area F, PGs = superior aspect of parietal area G, LIPd = dorsal aspect of the lateral intraparietal area.

More »

Expand

Fig 6.

Count of recurring features across folds.

By performing a count of feature appearance during the cross validation process, we determined that the same features (parcellations) responsible for positive class classification were also consistently used in 2 (R_PGs, R_PFcm, R_1, L_LIPd), 3 (R_POS2, R_IFSa, R_AAIC, R_6mp), or 4 (R_FST) out of the 5 folds, suggesting that these same features were regularly used throughout the decision making process. Note that this plot was abbreviated to only show features with a count of greater than 1.

More »

Expand