Into the Woods: Visualizing Random Forests with R
You've probably heard random forest models described as "black boxes," models that show an input and an output and nothing in between. In this post, we go over techniques to show what a random forest model is doing, to make it less of a black box.
Partial Dependence Plots¶
What Partial Dependence Plots Are¶
At its most basic form, a partial dependence plot is a graph showing your model's relationship to a single variable. How it does that exactly depends on whether you are using your random forest for classification or regression.
For a regression model, the intuitive way to understand a partial dependence plot is a graph that shows us how our predicted value, \(Y\), varies with a single free variable, \(X\), when we average out the effects of all the other variables, \(C\), in our model. How the partial dependence plot averages out the effects of the non-free variable is actually pretty cool, albeit, computationally intensive. First the algorithm will set X to a fixed value, then it will calculate \(Y\) values at that fixed \(X\) value, for every value of \(C\) that appears in the training data. Those Y values are then averaged, and this process is repeated for every \(X\) value that appears in the training data. I.e.
$$ \overline{Y}(X) = \frac{1}{N} \sum_{i=1}^N Y(X, C_i) $$ where ${C_1, C_2...C_N}$ are occurrences of $C$ that appear in the training data
Understanding how partial dependence plots work for classification models requires a little bit more understanding of how random forests work. For a quick recap, random forests operate by creating multiple tree models, and having each model vote on the outcome, with better performing models being allotted more votes. If you want a more thorough explanation on how random forests are built, I suggest you read this. A partial dependence plot in this case allows us to see the outcome of that vote for a single variable.
Extracting Rules From Random Forests¶
At the end of the day, a random forest model isn't actually that different from a single regression tree. Sure, the composition is more complex, but a random forest is still tracing a decision path from the root to a final leaf, with each fork defined by a threshold value on an individual feature, and each fork changing the outcome of the parent node's initial guess (usually the trainset mean) by a specified amount. This path can be described the equation bellow:
$$ Outcome = InitialGuess + featureContribution_1 + featureContribution_2...+featureContribution_N $$
It's just that normally this tree path is hidden from us, and the random forest model only gives us the output for a given input (hence the perception of random forests as black boxes). However, we can expose the rules of our random forerst model through use of the "inTrees" library in R.
The snippet of code bellow shows us how we can extract the rules for a random forest model.
rf <- randomForest(dat_rnd, target)
treeList <- RF2List(rf) # transform rf object to an inTrees' format
exec <- extractRules(treeList,dat_rnd) # R-executable conditions
ruleMetric <- getRuleMetric(exec,dat_rnd,target) # get rule metrics
In practice the list of rules we extract from a given RF model will be too large for practical use without further treatment (the number of rules can easily be in the thousands). But the inTrees library comes with built in tools that make it easy to prune the tree down to a practical list of if-then statements. Going back to the rules we extracted earlier, we can use the following bit of code to prune the list of rules to the best rules*, and present them as a table. Each row in the table gives us a rule, the number of decisions per rule, the rule's frequency in the forest, the prediction error of the rule, and the rule's prediction class.
ruleMetric <- pruneRule(ruleMetric,dat_rnd,target)
ruleMetric <- selectRuleRRF(ruleMetric,dat_rnd,target)
learner <- buildLearner(ruleMetric,dat_rnd,target)
Simp_Learner <- presentRules(ruleMetric,colnames(dat_rnd))
Variable Importance¶
This is probably the most well-known way to understand random forests, but I figured I should include it here for the sake of thoroughness. As the name implies, the variable importance function produces sorts the variables in a random forest model by order of importance, where importance can either be defined by a variable's contribution to the mean squared error or the variable's effect on node impurity. In order to use this function, be sure to specify "importance=true" when building the random forest model, as demonstrated in the code bellow.
RF = randomForest(formula2, data= train, importance= TRUE, ntree= 32)
imp <- as.data.frame(sort(importance(RF)[,1],decreasing = TRUE),optional = T)
names(imp) <- "% Inc MSE"
imp
Further Reading¶
The explanations in this blog posts are mostly summaries of the following articles:
Partial Dependence Plots: Partial Dependence Plots
Rules Extraction: One Tree To Rule Them All
Rules Extraction: Interpreting Random Forests