Inferring feature importance with uncertainties with application to large genotype data

Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.

splitting point, and its complement group (τ k ). Then, for any S ∈ Q k , w X,Y,ŷ (S ∪ {k}) − w X,Y,ŷ (S) having used that the two random variables V X,fj (S ∪ {k}) and V X,fj (S) are equivalent, or equal in distribution, for j / ∈ τ k . Note that the corresponding observed value v = v x,fj (S) for all S ∈ Q k since the regression tree f j does not include feature k, and the features are assumed mutually independent.
Using as loss function the binary cross-entropy, the loss function per sample is We then have February 28, 2023 2/8 (Sub-)SAGE with multiple linear regression. Consider a fitted linear regression modelŷ i =β T x i , with uncorrelated features. By applying the squared error loss, and by consideringβ as a constant (by using data not used to estimateβ), we have for a feature k, and a subset S ∈ Q k that with [1] for derivation of v x,ŷ (S) in linear regression. The second term in the third line of Eq. (12) is equal to zero since the features are independent, andβ is considered a constant. Notice therefore that the Sub-SAGE value, as well as the Shapley additive global importance (SAGE)-value, is independent of the subset S used, and equal to Eq. (12).
The second termβ k 2 Var(X k ) is in fact equal to the increased variance in the model by including feature k actively in the model since For linear regression models, this shows that the Sub-SAGE value is only positive if the agreement between the model and the independent test data (first term in Eq. (12)) upweights the increased variance in the model (second term in Eq. (12)) by including feature k.
We neither know the variance of X k nor the correlation between X k and Y , and so these must also be estimated from the data. The sample mean and sample covariance are unbiased and consistent estimators. Therefore, by using independent test data (x 0 1 , y 0 1 ), ..., (x 0 N I , y 0 N I ) of size N I , the estimator ofβ k , denote it T (β k ), is statistically independent from the test data, and by applying the sample mean and covariance we February 28, 2023 3/8 get the following unbiased estimate of Eq. (12) If we did not use training data separately for constructing the model, and test data to compute Sub-SAGE values, the second term in the third line of Eq. (12) would no longer become zero since the estimator T (β) naturally is correlated with the training data itself. It may seem confusing to treatβ k in Eq. (12) as a constant when the corresponding estimator T (β k ) indeed has a distribution based on the training data. However, one may look at the procedure of Sub-SAGE as objectively observing the properties of the raw model itself without taking into account the data used for training the model.
Sub-SAGE estimate for tree ensemble models with tree stumps. Consider a tree ensemble model with regression trees of depth one, so-called tree stumps. Each tree stump includes exactly one feature from the set M of all M features. In accordance with earlier notation, let τ k denote the set of tree stumps that include feature k. Then, Eq. (2) reduces to because all random variables V X,fj (S) for j / ∈ τ k , for every S are now independent of all V X,fj (S) and V X,fj (S ∪ {k}) for j ∈ τ k . Further, for every j ∈ τ k , V X,fj (S) = E X [f j (X)], a constant equal to the expected value of the output of the regression tree f j , and E X [V X,fj (S ∪ {k})] = E X [f j (X)], since the regression tree f j only includes feature k. Therefore, the last term in Eq. (2) vanishes. Observe that, in the case of tree stumps, Likewise, Hence, the expression given in Eq. (15) independent of the subset S. The expression in Eq. (15) is therefore also equal to the Sub-SAGE value,ψ k (or SAGE value). Both the covariance and the variance need to be must be estimated in practice. Given independent test data (x 0 1 , y 0 1 ), . . . , (x 0 N I , y 0 N I ), an unbiased estimate is given bŷ Property 2 (Dummy property (null player)). Given a feature k where v(S ∪ {k}) = v(S) for all S ∈ Q k . Then ψ k = 0, and so the dummy property follows by definition.
Property 3 (Linearity). Given two value functions v(S) and w(S), the Sub-SAGE value of the sum of the value functions v(S) + w(S) is equal to the sum of the Sub-SAGE for each value function, Hence, the linearity property follows by definition.
Property 4 (Monotonicity). Consider two modelsf 1 andf 2 used to predict the same relationship y = f (x), for the same features x. If for any feature k we have vf for all S ∈ Q k , then by definition, ψf 1 k ≥ ψf 2 k , with ψf 1 k the Sub-SAGE value of feature k when applying modelf 1 and ψf 2 k the corresponding Sub-SAGE value when applying modelf 2 . This means that an adjustment of modelf 2 tof 1 such that feature k's importance increases also increases its Sub-SAGE value. Therefore, the monotonicity property follows by definition.
Observation 1 (Sub-SAGE does not share the efficiency property). Consider the definition of the Shapley value, ϕ k , applied on a specific value function v: among all S such that {m} / ∈ S, and also the same among all S such that {m} ∈ S. For both SAGE, and Sub-SAGE, let k 1 and k 2 be the sum of the weights for all subsets S where {m} / ∈ S and for all subsets S where {m} ∈ S respectively. For SAGE, using the results from S6 Appendix, k 1 = k 2 = 1/2. For Sub-SAGE, one can similarly show that where M is the total number of features in the model. In other words, the Sub-SAGE and SAGE value only differ due to these constants for a As a practical example we will consider feature 6 (k = 6). The first term in (27) can be directly derived using the computed SHAP value for feature 6, and swap observed values with random variables to get V X,Y (S)) and V X,ŷ (S ∪ {k})) instead of v x,ŷ (S)) and v x,ŷ (S)). Due to the assumption of feature independence, the second term in (27) involving increased variance only depends on features 5 and 6. As a result, we get that the Sub-SAGE value is given by: − (k 1 E[X 5 ] 2 + k 2 E[X 2 5 ]) Var(I(X 6 > 7)).
The expected values can be computed exactly by construction. For the corresponding SAGE value, change k 1 and k 2 accordingly. See Table 2 in paper.