Random Forest vs. Extra Trees

Under default settings, decision trees tend to overfit. This occurs because, in implementations like those in scikit-learn, a decision tree is permitted to expand until every leaf node is perfectly pure.

When the model accurately classifies all training instances, it results in 100% overfitting and poor generalization. Random Forests mitigate this issue by incorporating randomness in two key ways:

Creating bootstrapped datasets.
Randomly selecting candidate features for node splitting.

These methods support the objectives of Bagging, which we explored in detail in our article: Why is Bagging So Effective at Reducing Variance?

However, there is another algorithm that adds even more randomness to the random forest framework: Extra Trees.

Note: "Extra Trees" doesn’t imply an increased number of trees; it stands for "Extra Randomized."

Extra Trees function similarly to Random Forests but introduce an additional layer of randomness:

A bootstrapped dataset is created for each tree, just like in Random Forests.
Candidate features are selected randomly for node splits, as in Random Forests.

The key difference is that while Random Forests calculate the optimal split threshold for each candidate feature, Extra Trees randomly select this split threshold as well.

This additional randomness is where Extra Trees gain their advantage. After randomly selecting the split threshold, the best candidate feature is then chosen, which further reduces the model's variance.

Below, I've compared three models—Decision Tree, Random Forest, and Extra Trees using a dummy dataset:

Decision Trees tend to overfit completely.

Random Forests show improved performance.

Extra Trees perform slightly better than both.

⚠️ Important Note: When using Extra Trees in scikit-learn, be aware that the bootstrap flag is set to False by default.

Ensure you set bootstrap=True; otherwise, the algorithm will utilize the entire dataset for each tree.

In conclusion, both Random Forests and Extra Trees are powerful ensemble learning methods that effectively address the overfitting challenges commonly associated with decision trees. While Random Forests introduce randomness through bootstrapping and feature selection, Extra Trees take it a step further by randomly selecting split thresholds for candidate features, resulting in even lower variance and improved performance.

When implementing Extra Trees, it’s crucial to remember to set bootstrap=True to ensure that the model benefits from bootstrapped datasets, enhancing its ability to generalize beyond the training data. Overall, both algorithms offer valuable tools for improving predictive accuracy and robustness in various machine learning applications, allowing data scientists to choose the best approach based on their specific needs and data characteristics.

Random Forest vs. Extra Trees

Master AI Tools in Just 5 Minutes a Day

Newsletter language