Minimize the Number of Trees in a Random Forest Model

I frequently use random forests and have consistently observed that we tend to create far more decision trees than necessary.

Of course, this can be adjusted as a hyperparameter, but finding the optimal number of trees typically involves training multiple random forest models, which can be time-consuming.

Today, I’ll share an incredible trick I recently developed to:

Boost the accuracy of a random forest model
Reduce its overall size
Dramatically improve its prediction speed

And the best part? You can achieve all this without retraining the model.

Let’s dive in!

The Logic

As we know, a random forest model is an ensemble of numerous individual decision trees:

The final prediction in a random forest is generated by aggregating the predictions from each individual, independent decision tree.

Since each decision tree in a random forest operates independently, it follows that each tree will have its own validation accuracy, correct?

This approach implies that some decision trees will perform better than others. So, what if we try the following?

Calculate the validation accuracy for each individual decision tree.
Sort these accuracies in decreasing order.
Keep only the top “k” decision trees with the highest validation accuracy and remove the rest.

By doing this, we retain only the best-performing trees in the random forest, as measured on the validation set.

Pretty cool, right?

Deciding the Optimal “k”

To find the best “k,” we can create a cumulative accuracy plot:

Plot the accuracy of the random forest by progressively including more trees:
- Starting with the first two top-performing decision trees.
- Adding the third, then the fourth, and so on.

Typically, accuracy will initially improve as more decision trees are included, then level off or even decrease.

By observing this plot, we can identify the most optimal “k” value.

Implementation

Let’s go through the implementation steps.

First, we train our random forest model as we normally would:

Next, we need to calculate the accuracy of each individual decision tree in the random forest.

In scikit-learn, each tree within a random forest model can be accessed through the model.estimators_ attribute.

To do this, we can iterate over each tree in model.estimators_ and compute its validation accuracy individually:

The model_accs is a NumPy array that holds the tree ID along with its corresponding validation accuracy. Here's how you can create and populate this array:

Now, we need to rearrange the decision tree models in the model.estimators_ list according to their validation accuracies, sorted in decreasing order. We can accomplish this by first sorting the accuracies and then reordering the decision trees based on their performance. This way, we'll have the best-performing trees at the beginning of the list, making it easier to select the top models for further analysis.

This list tells us that the 65th indexed model is the highest performing, followed by 97th indexed, and so on….

Next, let's reorder the tree models in the model.estimators_ list according to the order specified in model_ids. This will ensure that the decision trees are arranged based on the desired sequence, making it easier to work with the selected models moving forward.

Great! Now that we have rearranged the tree models, we can create the cumulative accuracy plot we discussed earlier. This line plot will illustrate the accuracy of the random forest by progressively including:

Only the first two decision trees
Only the first three decision trees
Only the first four decision trees
And so on

This visualization will help us observe how the accuracy changes as we add more trees, allowing us to determine the optimal number of decision trees to retain.

In the code above:

We create a copy of the original model, naming it small_model.
In each iteration, we update small_model to include only the first “k” trees from the base model.
Finally, we evaluate small_model using just those “k” trees.

When we plot the cumulative accuracy results, we will obtain a visualization that shows how the model's accuracy evolves as we include more trees. This plot will provide valuable insights into the effectiveness of using a smaller subset of decision trees.

It's evident that the maximum validation accuracy is achieved by using only 10 trees, resulting in a ten-fold reduction in the total number of trees.

When we compare the accuracy and run-time of the models, we find:

We achieved a 6.5% increase in accuracy and a 13-fold improvement in prediction run-time.

Now, let me ask you:

Did we do any retraining or hyperparameter tuning? No.

By reducing the number of decision trees, didn't we enhance the run-time? Absolutely.

Isn't that impressive?

However, it’s important to note that overly reducing the ensemble size is not advisable, as we want to ensure our random forest maintains a diverse range of decision trees.

Additionally, be cautious about overfitting the validation set by selecting only the trees that perform best on it.

Choosing the optimal number of trees, or "k," is often a subjective decision and should not rely solely on validation accuracy; other factors should be considered as well.

What are your thoughts on this?

If you’re curious about the mathematical details and want to delve deeper into Bagging from a mathematical perspective, check out our article: Why is Bagging So Effective at Reducing Variance?

It covers all the essential details, including:

Why Bagging is so effective
Why we sample rows from the training dataset with replacement

In conclusion, optimizing the number of decision trees in a random forest can lead to significant improvements in both accuracy and run-time without the need for retraining or hyperparameter tuning. By strategically selecting the top-performing trees based on validation accuracy, we can achieve a balance between model performance and efficiency.

However, it's crucial to maintain a diverse set of trees to prevent overfitting and ensure robust predictions. The process of determining the ideal number of trees should consider multiple factors beyond just validation accuracy, allowing for a more nuanced approach to model optimization.

This exploration highlights the effectiveness of ensemble methods like random forests and the potential for further enhancing their performance through thoughtful modifications. For those interested in the underlying mathematics of Bagging and its impact on variance reduction, further reading is available to deepen your understanding.

Minimize the Number of Trees in a Random Forest Model

The Logic

Deciding the Optimal “k”

Implementation

Master AI Tools in Just 5 Minutes a Day

Newsletter language