Harnessing the Power of Bagging in Ensemble Learning

In 1906, Sir Francis Galton, a polymath and cousin of Charles Darwin, set out to demonstrate the folly of collective judgment through an intriguing experiment. At an exposition, attendees were invited to guess the weight of an ox, with each participant submitting their prediction into a box. The prize? The individual with the closest estimate would take the animal home.

Galton's hypothesis was grounded in the belief that only those with specific training and experience would make accurate guesses, while the majority—untrained and inexperienced—would fail miserably. He anticipated that even if the predictions of the general public were averaged, the result would be significantly off from the actual weight and far from the guess of the most knowledgeable participant.

An illustration of the history of Galton and the weight of the cow

Eight hundred people participated in the experiment. True to his scientific rigor, Galton meticulously analyzed the statistical data. To his astonishment, he found that the average of all the guesses, which was 1,197 pounds, was remarkably close to the actual weight of the ox, which was 1,198 pounds. This collective estimate was not only surprisingly accurate but also surpassed the predictions made by individual experts in the field, such as butchers, cattle veterinarians, and farmers.

This unexpected result challenged Galton's initial hypothesis and suggested a fascinating insight: the aggregated judgment of a diverse group could be more accurate than that of even the most knowledgeable experts.

This story is recounted in James Surowiecki's book The Wisdom of Crowds[1] and serves as an illustrative example of the concept behind one of the most powerful families of machine learning models today: Ensembles.

Ensemble methods in machine learning rely on the principle that combining the predictions of multiple models can often lead to better performance than relying on a single model. Much like how the diverse guesses of a crowd can average out to an accurate estimate, ensemble techniques aggregate the strengths of various models to enhance predictive accuracy and robustness. This approach underscores the importance of diversity and collective wisdom in achieving reliable results in complex tasks.

Model Ensembles

The concept of model ensembles involves training multiple models, each on different subsets or versions of the data. This technique leverages the idea that each model may overfit in distinct ways, capturing different aspects of the dataset's nuances and potential noise. Typically, individual models in an ensemble tend to have low bias but high variance, as seen in models like deep decision trees, which can be highly sensitive to the specific data they are trained on.

When you ensemble many models together the “knowledge” of each one over the concept to represent is aggregated in a way that errors are compensated and assertions are summed

To make predictions on new instances, ensemble methods employ a voting mechanism. Each model in the ensemble provides a classification, and the most frequently chosen class among all models is returned as the final prediction. This collective decision-making process helps reduce the variance in the final classification, as the individual models' errors and overfitting tendencies are averaged out, leading to more stable and reliable predictions—essentially, a form of "magic."

Additionally, if the individual models return probabilities rather than binary decisions, a weighted voting scheme can be employed. In this approach, models with higher confidence in their predictions can have a greater influence on the final decision, further refining the ensemble's output. This flexibility allows ensembles to harness the strengths of diverse models while mitigating their individual weaknesses.

Bagging

Bagging, short for Bootstrap Aggregating, is a technique used to improve the stability and accuracy of machine learning algorithms. It involves creating multiple new training datasets by sampling from the original dataset with replacement, a method known as bootstrap sampling. Each of these bootstrapped datasets is then used to train a separate model.

The key idea behind bagging is that by training models on slightly different datasets, each model will learn different aspects of the data, thus reducing the overall variance. After training, the predictions from these models are combined, typically by averaging for regression tasks or by majority voting for classification tasks. This ensemble approach helps to smooth out the noise and variability in individual model predictions, leading to a more accurate and robust final model.

Bagging is particularly effective for algorithms with high variance, such as decision trees, as it mitigates the risk of overfitting while maintaining the low bias inherent to the base models.

Step-by-Step Process

Dataset Splitting: The process begins by dividing the original training dataset into multiple subsets. This is done through bootstrap sampling, where each subset is created by randomly selecting samples from the original dataset. Importantly, these samples are chosen with replacement, meaning the same individual data points can appear multiple times within the same subset. This results in different random samples for each subset. Each one is typically of the same size as the original dataset. This uniformity ensures that each model in the ensemble is trained on a comparable amount of data, allowing them to potentially learn different patterns or aspects from the dataset.
Model Training: For each of these bootstrapped subsets, a separate model is trained independently. These models may all be of the same type, such as decision trees, or they may vary depending on the specific implementation of the bagging technique.
Aggregation of Models: After training, the predictions from each individual model are aggregated to form a single predictive model. This aggregation is usually done by averaging the outputs for regression tasks or by taking a majority vote for classification tasks. The resulting model leverages the diverse insights from each of the individual models, leading to a final prediction that is generally more accurate and less sensitive to overfitting than any single model trained on the full dataset.

Illustration of how bagging ensembles work. Source [2]

Characteristics of Bagging

Bagging offers several notable benefits and characteristics that enhance the performance and robustness of predictive models:

Reduction in Variance: By aggregating predictions from multiple models trained on different subsets of the data, bagging effectively reduces the variance in the final model. This reduction occurs because the various models likely overfit different aspects of the data, and averaging their outputs cancels out the overfitting effects.
Effectiveness in High-Variance Datasets: Bagging is particularly useful in datasets with high variance, where models are prone to being highly sensitive to specific training data. By using bootstrap samples, bagging ensures that the ensemble learns diverse representations of the data, thus smoothing out variability.
Mitigation of Overfitting: The ensemble nature of bagging helps to mitigate overfitting, a common problem in complex models like deep decision trees. Since each model in the ensemble is trained on a different subset of the data, the final aggregated model generalizes better to unseen data compared to any single model.
Noise Reduction: Bagging can reduce the impact of noise and outliers in the data. Outliers may not appear in every bootstrapped dataset, thus their influence is diminished in the final prediction. This helps in creating a more robust model that is less sensitive to anomalies in the training data.
Potential for Improvement with Weighted Voting: In cases where models output probabilities, applying a weighted voting scheme can slightly improve the performance of the ensemble. By giving more weight to more confident predictions, the ensemble can potentially make more accurate decisions, particularly in scenarios where some models are more reliable than others.

Bootstrap Aggregated Neural Networks

Bootstrap Aggregated Neural Networks (BANNs)[4] is an ensemble method that applies the principles of bagging specifically to neural networks. Similar to how bagging works with decision trees, BANNs involve training multiple neural networks on different bootstrapped subsets of the original training data. Each neural network is trained independently, potentially with different architectures or hyperparameters, to capture various aspects of the data distribution.

In BANNs, the process begins by generating several bootstrapped samples from the training dataset. Each sample is created by randomly selecting data points with replacement, ensuring that each dataset has the same size as the original. This method introduces diversity among the datasets, as each one may emphasize different features and patterns due to the inclusion and exclusion of specific data points.

Each neural network in the ensemble is then trained on one of these unique datasets. This approach helps mitigate overfitting, a common issue with neural networks, especially deep ones, due to their high capacity to model complex data. By training on different datasets, the networks learn different representations and decision boundaries, reducing the overall variance when their predictions are combined.

The final prediction in BANNs is made by aggregating the outputs of all the neural networks in the ensemble. This aggregation can be done through simple averaging in regression tasks or majority voting in classification tasks. The ensemble’s output is typically more accurate and robust than any single neural network's prediction, as the errors of individual models tend to cancel out.

BANNs are particularly useful in situations where the dataset is large and complex, and where neural networks are prone to overfitting due to their flexibility. By employing bagging, BANNs enhance the stability and generalizability of neural network models, making them a powerful tool in various machine learning applications.

Problem with Bagging and Decision Trees

While bagging is a powerful technique, it can encounter limitations when applied to decision trees, especially in datasets where a few attributes are particularly strong predictors.

Similarity of Trees: If the dataset contains a small number of attributes that are significantly stronger predictors than others, these attributes tend to dominate the decision-making process of the trees. As a result, all the trees in the ensemble may become very similar, despite the use of different bootstrapped datasets. This occurs because these strong predictors are consistently selected at the top splits (near the root) of the trees, leading to a lack of diversity in the ensemble.
Dominance of Strong Predictors: When building decision trees, the algorithm typically selects attributes that offer the highest information gain or Gini impurity reduction for the initial splits. If the same few attributes are always the best choices, they will repeatedly appear near the root of the trees across all the models. This reduces the overall benefit of bagging, as the ensemble lacks variability and thus may not significantly reduce variance or improve predictive performance compared to a single decision tree.

To address this issue, methods like Random Forests introduce additional randomness by selecting a random subset of attributes at each split, ensuring that not all trees rely on the same strong predictors and thereby increasing the diversity within the ensemble.

Random Forest

Random Forest[3] is a powerful ensemble learning method that enhances the predictive accuracy and stability of decision tree models [5]. It builds upon the bagging technique by incorporating randomness not only in the data sampling process but also in the feature selection process, which increases model diversity and reduces the risk of overfitting.

Key Components and Process:

Bootstrap Sampling: Like in traditional bagging, Random Forest begins by creating multiple bootstrapped samples from the original dataset. Each sample is generated by randomly selecting data points from the original set, with replacement, resulting in several different training sets. This process ensures that each tree in the forest is trained on a unique subset of data, thus incorporating different perspectives and potential biases from the dataset.
Random Feature Selection: A distinguishing feature of Random Forest is the random selection of features at each split in the decision tree. For every node in a tree, instead of considering all available features, the algorithm randomly selects a small subset of features. The best split is then determined only from this subset, rather than the entire set of features. This approach prevents the strongest predictors from dominating the model structure and forces the trees to consider different combinations of features, enhancing the overall diversity of the ensemble.

Tree Building and Aggregation: Each tree in the Random Forest is grown to its maximum depth without pruning. This means that the individual trees can potentially overfit their respective bootstrapped datasets. However, because each tree is based on different data and features, the ensemble as a whole benefits from a reduction in variance. The final prediction for a given instance is obtained by aggregating the predictions of all trees, typically using majority voting for classification tasks or averaging for regression tasks.

Example of a Random Forest model. Source [6]

Advantages of Random Forest:

Robustness and Accuracy: The method is particularly robust against overfitting, a common problem in single decision trees. This is due to the averaging of multiple trees, which smooths out the model's predictions and makes it less sensitive to noise in the data.
Handling High-Dimensional Data: Random Forest can effectively handle datasets with a large number of features. The random feature selection mechanism ensures that not all features are considered at each split, which helps in managing high-dimensional data and reduces the computational load.
Feature Importance Measurement: An added benefit of Random Forest is its ability to provide estimates of feature importance. By evaluating how much each feature contributes to the model's predictive power, practitioners can gain insights into the most significant factors influencing the outcomes, which can be valuable for understanding the data and making decisions.
Versatility: Random Forest can be applied to both classification and regression problems. It can handle various data types and distributions, making it a versatile tool in the machine learning toolkit.
Resilience to Overfitting: The inherent randomness in both data and feature selection contributes to its resilience against overfitting, especially when the number of trees in the forest is large. The law of large numbers helps in ensuring that the average of many predictions will converge towards the true value.

Challenges and Considerations:

While Random Forest is generally robust and effective, it can become computationally expensive with very large datasets and a large number of trees. The model's complexity and the number of trees can also lead to longer training times. However, these trade-offs are often outweighed by the benefits of increased accuracy and stability.

In summary, Random Forest is a highly effective ensemble learning method that enhances the accuracy of decision trees by introducing diversity through randomization in both data sampling and feature selection. Its ability to handle complex datasets and provide insights into feature importance makes it a widely used technique in various fields of machine learning and data analysis.

Conclusions on Ensembles and Bagging

Ensemble methods, particularly those using bagging, offer a powerful way to enhance the accuracy and robustness of predictive models. By combining multiple models trained on different subsets of the data, bagging reduces variance and mitigates overfitting. Techniques like Random Forest further enhance this approach by introducing random feature selection, ensuring greater model diversity and improved generalization. These methods are particularly effective in complex, high-dimensional datasets and provide useful insights into feature importance.

In our next article, we will delve into boosting, another ensemble method that strategically focuses on improving prediction accuracy by iteratively addressing the errors made by previous models.

References

[1] The Wisdom of Crowds

[2] Yang, X., Wang, Y., Byrne, R., Schneider, G., & Yang, S. (2019). Concepts of artificial intelligence for computer-assisted drug discovery. Chemical reviews, 119(18), 10520-10594.

[3] Random forest

[4] Zhang, J. (1999). Developing robust non-linear models through bootstrap aggregated neural networks. Neurocomputing, 25(1-3), 93-113.

[5] The Fundamental Tool in Machine Learning: Decision Trees

[6] Khan, M. Y., Qayoom, A., Nizami, M. S., Siddiqui, M. S., Wasi, S., & Raazi, S. M. K. U. R. (2021). Automated Prediction of Good Dictionary EXamples (GDEX): A Comprehensive Experiment with Distant Supervision, Machine Learning, and Word Embedding‐Based Deep Learning Techniques. Complexity, 2021(1), 2553199.