Why is Bagging So Effective at Reducing Variance?

The random forest is a highly powerful and robust model, created by combining multiple decision trees.

What gives it an edge over a traditional decision tree model is the use of Bagging:

Anyone familiar with Random Forests has likely come across Bagging and how it functions.

In my experience, there are many resources that clearly explain:

- The algorithmic process behind Bagging in random forests
- Experimental demonstrations of how Bagging reduces overall variance (or mitigates overfitting)

However, these resources often fall short in explaining the intuition behind:

- Why Bagging is so effective
- The reason for sampling rows from the training dataset with replacement

In this article, I’ll tackle each of these questions to provide you with a clear, intuitive understanding of:

- Why Bagging makes the Random Forest algorithm so effective at reducing variance
- Why Bagging uses sampling with replacement

Let’s begin!

The Overfitting Experiment

Decision trees are widely appreciated for their interpretability and simplicity.

However, less well-known is their tendency to overfit the data they’re trained on. This overfitting issue arises because a standard decision tree algorithm makes greedy choices, selecting the best possible split at each node. This process increasingly purifies the nodes as you move further down the tree.

If we don’t set limits on the tree's growth, a decision tree will, in most cases, end up perfectly fitting the training dataset, essentially overfitting it completely.

For example, let’s say we have some dummy data, and we want to deliberately overfit it completely using a model, like a linear regression model, just for demonstration purposes.

This task will require considerable effort from the engineer.

In other words, we can’t simply use linear_model.fit(X, y) here to directly overfit the dataset.

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(X, y)

Instead, as noted earlier, fully overfitting this dataset will require extensive feature engineering.

For example, to deliberately overfit this dummy dataset, we would need to create specific features mainly higher-degree polynomial features in this case.

Here’s how this would look:

As demonstrated above, as we increase the degree of the feature x in polynomial regression, the model gradually overfits the dataset. By the time the polynomial degree reaches 40, the model fully overfits the data.

The key takeaway is that overfitting this dataset (or any dataset, for that matter) with linear regression usually requires some level of feature engineering. While the example dataset was easy to overfit, a complex dataset with a variety of feature types would likely demand significant engineering effort to achieve the same result.

With a decision tree model, however, this is not the case at all. Overfitting any dataset with a decision tree requires no additional effort from the engineer.

In other words, we can simply run dtree_model.fit(X, y) to overfit any dataset, whether for regression or classification.

This is because a standard decision tree continues to expand by adding new levels until all leaf nodes are pure. As a result, by default, it fully overfits the dataset, as shown below:

from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor()
tree_model.fit(X, y)

The same issue occurs with classification datasets as well.

For example, let's take a look at the following dummy binary classification dataset.

It's evident that there is significant overlap between the two classes.

However, a decision tree remains unconcerned about this overlap.

The model will still meticulously establish its decision boundary to classify the dataset with 100% accuracy.

This is illustrated below:

It’s essential to tackle this issue.

Remedies to Prevent Overfitting

There are several strategies to mitigate overfitting, including techniques like pruning and ensembling.

Pruning

Pruning is a widely used technique in tree-based models that involves removing branches (or nodes) to simplify the model.

For example, we can intentionally limit the growth of the decision tree by setting a maximum depth. In the implementation provided by scikit-learn, this can be achieved by specifying the max_depth parameter.

Pruning can also be achieved by specifying the minimum number of samples required to split an internal node.

Another technique for pruning is called cost-complexity pruning (CCP).

CCP takes into account two key factors for pruning a decision tree:

Cost (C): The number of misclassifications
Complexity (C): The number of nodes

It’s important to note that removing nodes will typically lead to a decrease in the model's accuracy.

Therefore, with decision trees, the core idea is to iteratively remove sub-trees, ensuring that each removal results in:

A minimal increase in classification cost
A maximum reduction in complexity (or number of nodes)

This process is illustrated below:

n the image above, both sub-trees lead to the same increase in cost. However, it’s more logical to remove the sub-tree with a greater number of nodes to decrease computational complexity.

In scikit-learn, we can manage cost-complexity pruning using the ccp_alpha parameter:

A large value of ccp_alpha → results in underfitting
A small value of ccp_alpha → results in overfitting

The goal is to find the optimal value of ccp_alpha that produces a better model.

The effectiveness of cost-complexity pruning is clearly illustrated in the image below:

Training the decision tree without any cost-complexity pruning results in a complex decision region plot, leading to a model that achieves 100% accuracy. However, by adjusting the ccp_alpha parameter, we can prevent overfitting while simultaneously improving accuracy on the test set.

Ensemble Learning

Another effective technique for preventing overfitting is ensemble learning.

In essence, ensemble learning combines multiple models to create a more robust overall model.

Whenever I want to intuitively demonstrate the significant power of ensemble methods, I refer to the following image:

Ensemble methods are fundamentally based on the concept that by aggregating the predictions of multiple models, the weaknesses of individual models can be reduced. This combination is expected to yield better overall performance.

Ensembles are primarily constructed using two different strategies:

Bagging
Boosting

1) Bagging

Bagging, or Bootstrap Aggregating, involves training multiple models independently on different subsets of the training data and then averaging their predictions (for regression) or taking a majority vote (for classification). This approach helps reduce variance and improve model stability.

4o mini

Here’s how Bagging works:

Data Subsets: Bagging creates different subsets of the training data through bootstrapping, which means sampling with replacement.
Model Training: Each subset is used to train a separate model.
Prediction Aggregation: Finally, the predictions from all models are aggregated to produce the final prediction.

Some common models that utilize Bagging include:

Random Forests
Extra Trees

2) Boosting

Boosting, on the other hand, involves training models sequentially. Each new model focuses on the errors made by the previous models, aiming to correct them. This process continues until a specified number of models have been trained or no further improvements can be made. Boosting tends to improve model accuracy by reducing both bias and variance.

Here’s how Boosting works:

Iterative Training: Boosting is an iterative training process.
Focus on Misclassifications: Each subsequent model places greater emphasis on the misclassified samples from the previous model.
Weighted Predictions: The final prediction is a weighted combination of all individual predictions.

Some common models that utilize Boosting include:

XGBoost
AdaBoost

Overall, ensemble models significantly enhance predictive performance compared to using a single model. They are generally more robust, better at generalizing to unseen data, and less susceptible to overfitting.

As mentioned earlier, this article specifically focuses on Bagging.

In my experience, there are numerous resources that clearly explain:

How Bagging works algorithmically in random forests
Experimental demonstrations of how Bagging reduces overall variance (or overfitting)

For instance, we can experimentally verify the reduction in variance ourselves.

The following diagram illustrates the decision region plots obtained from both a decision tree and a random forest model:

It’s evident that a random forest exhibits significantly lower variance (overfitting) compared to the decision tree model.

Typically, resources explain the concept of Bagging like this:

Instead of training a single decision tree, multiple trees are trained, each on different subsets of the dataset generated through sampling with replacement. Once all the trees are trained, their predictions are averaged to obtain the final output. This approach reduces overall variance and enhances the model’s generalization capabilities.

However, these resources often struggle to convey the intuition behind:

Why Bagging is so effective
The rationale for sampling rows from the training dataset with replacement

In this article, I will address all these questions and provide clear and intuitive explanations regarding:

Why Bagging makes the Random Forest algorithm highly effective at reducing variance
The reasoning behind Bagging's use of sampling with replacement

Towards the end, we will also develop an understanding of the Extra Trees algorithm and how it further contributes to variance reduction.

Once we grasp the objective that Bagging aims to achieve, we will formulate new strategies to create our own Bagging algorithms.

Let’s get started!

Motivation for Bagging

As illustrated in an earlier diagram, the fundamental concept behind a random forest model is to train multiple decision tree models, each on a different sample of the training dataset.

During inference, we average the predictions from all the decision trees to obtain the final prediction:

As we noted earlier, training multiple decision trees helps reduce the overall variance of the model.

Conclusion

In summary, Bagging is a powerful technique that significantly enhances the performance of machine learning models, particularly decision trees. By training multiple decision trees on different subsets of the training data and averaging their predictions, we can effectively reduce overfitting and improve the model's generalization to unseen data.

This article has explored the core principles behind Bagging, the rationale for sampling with replacement, and the mathematical foundations that verify its effectiveness in variance reduction. Additionally, we discussed the importance of ensemble methods, including Extra Trees, and how they further contribute to improving predictive performance.

Ultimately, understanding Bagging not only equips us with a robust strategy for building more resilient models but also inspires us to develop innovative approaches to ensemble learning. As we continue to explore these methods, we can better tackle complex datasets and enhance our machine learning capabilities.