# Understanding Bias in RF Variable Importance Metrics

Caleb Scheidel

Posted on
Machine Learning rstats

Random forests are typically used as “black box” models for prediction, but they can return relative importance metrics associated with each feature in the model. These can be used to help interpretability and give a sense of which features are powering the predictions. Importance metrics can also assist in feature selection in high dimensional data. Careful attention should be paid to the data you are working with and when it is appropriate to use and interpret the different variable importance metrics from random forests.

## Words of caution

A recent blog post from a team at the University of San Francisco shows that default importance strategies in both R (randomForest) and Python (scikit) are unreliable in many data scenarios. Particularly, mean decrease in impurity importance metrics are biased when potential predictor variables vary in their scale of measurement or their number of categories.

It is also known that importance metrics are biased when predictor variables are highly correlated, leading to suboptimal predictor variables being artificially preferred. This has actually been known for over ten years (Strobl et al, 2007 and Strobl et al, 2008), but it can be easy to assume the default importances of popular packages will fit your unique datasets.

The papers and blog post demonstrate how continuous and high cardinality variables are preferred in mean decrease in impurity importance rankings, even if they are equally uninformative compared to variables with less categories. The authors suggest using permutation importance instead of the default in these cases. If the predictor variables in your model are highly correlated, conditional permutation importance is suggested.

## Mean decrease in impurity (Gini) importance

The mean decrease in impurity (Gini) importance metric describes the improvement in the “Gini gain” splitting criterion (for classification only), which incorporates a weighted mean of the individual trees’ improvement in the splitting criterion produced by each variable The gini impurity index is defined as:

$G = \sum_{i=1}^{n_c} p_i(1 - p_i) = 1 - \sum_{i=1}^{n_c} p_i^2$

where $$n_c$$ is the number of classes in the target variable and $$p_i$$ is the ratio of this class. In other words, it measures the disorder of a set of elements. It is calculated as the probability of mislabeling an element assuming that the element is randomly labeled according to the distribution of all the classes in the set. For regression, the analagous metric to the Gini index would be the RSS (residual sum of squares).

To see an example of how the Gini index is calculated, let’s use the iris data set. For the purposes of this post, we’ll convert the Petal.Length and Petal.Width features to factors, rounding to the nearest integer to decrease the number of “categories” in each variable.

library(tidyverse)
library(skimr)
library(knitr)

iris <- iris %>%
as_tibble() %>%
mutate_at(vars(starts_with("Petal")), funs(as.factor(round(., digits = 0))))

iris %>%
skim()
 Name Piped data Number of rows 150 Number of columns 5 _______________________ Column type frequency: factor 3 numeric 2 ________________________ Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Petal.Length 0 1 FALSE 7 5: 35, 4: 34, 2: 26, 1: 24
Petal.Width 0 1 FALSE 3 2: 64, 0: 49, 1: 37
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.0 3.3 4.4 ▁▆▇▂▁

There are now 3 unique categories for Petal.Width, and 7 unique categories for Petal.Length. We leave Sepal.Length and Sepal.Width as continuous variables.

Here is an example of calculating the Gini index at a couple of randomly chosen splits using the iris dataset:

## Why is impurity importance biased?

Each time a break point is selected in a variable, every level of the variable is tested to find the best break point. Continuous or high cardinality variables will have many more split points, which results in the “multiple testing” problem. That is, there is a higher probability that by chance that variable happens to predict the outcome well, since variables where more splits are tried will appear more often in the tree.

We’ll continue with the iris example and use the following plotting function from the authors of the blog post at USF to create the plots showing the bias in importance rankings:

library(randomForest)

create_rfplot <- function(rf, type){

imp <- importance(rf, type = type, scale = F)

featureImportance <- data.frame(Feature = row.names(imp), Importance = imp[,1])

p <- ggplot(featureImportance, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "#53cfff", width = 0.65) +
coord_flip() +
theme_light(base_size = 20) +
theme(axis.title.x = element_text(size = 15, color = "black"),
axis.title.y = element_blank(),
axis.text.x  = element_text(size = 15, color = "black"),
axis.text.y  = element_text(size = 15, color = "black"))
return(p)
}

The function plots the importance metrics on the x-axis and the variables on the y-axis, visualizing the relative importance rankings of each variable.

In the randomForest package, type = 2 is the default, reporting the mean decrease in impurity importance metrics. The equivalent argument in ranger is type = "impurity".

What are the mean decrease in impurity importance rankings of these features?

set.seed(1)

rf1 <- randomForest(
Species ~ .,
ntree = 40,
data = iris,
nodesize = 1,
replace = FALSE,
importance = TRUE
)

create_rfplot(rf1, type = 2)