In October 2006, Netflix announced a $1 million prize for the team that could improve movie recommendations by 10% via Netflix’s own proprietary recommendation system, CineMatch. The Netflix Grand Prize was one of the first-ever open data science competitions and attracted tens of thousands of teams.
The training set consisted of 100 million ratings that 480,000 users had given to 17,000 movies. Within three weeks, 40 teams had already beaten CineMatch’s results. By September 2007, more than 40,000 teams had entered the contest, and a team from AT&T Labs took the 2007 Progress Prize by improving upon CineMatch by 8.42%.
As the competition progressed with the 10% mark remaining elusive, a curious phenomenon emerged among the competitors. Teams began to collaborate and share knowledge about effective feature engineering, algorithms, and techniques. Inevitably, they began combining their models, blending individual approaches into powerful and sophisticated ensembles of many models. These ensembles combined the best of various diverse models and features, and they proved to be far more effective than any individual model.
In June 2009, nearly two years after the contest began, BellKor’s Pragmatic Chaos, a merger of three different teams, edged out another merged team, The Ensemble (which was a merger of more than 30 teams!), to improve on the baseline by 10% and take the $1 million prize. Just “edged out” is a bit of an understatement as BellKor’s Pragmatic Chaos managed to submit their final models barely 20 minutes before The Ensemble got their models in (http://mng.bz/K08O). In the end, both teams achieved a final performance improvement of 10.06%.
While the Netflix competition captured the imagination of data scientists, machine learners, and casual data science enthusiasts worldwide, its lasting legacy has been to establish ensemble methods as a powerful way to build practical and robust models for large-scale, real-world applications. Among the individual algorithms used are several that have become staples of collaborative filtering and recommendation systems today: k-nearest neighbors, matrix factorization, and restricted Boltzmann machines. However, Andreas Töscher and Michael Jahrer of BigChaos, co-winners of the Netflix prize, summed up1 their keys to success:
During the nearly 3 years of the Netflix competition, there were two main factors which improved the overall accuracy: the quality of the individual algorithms and the ensemble idea. . . . The ensemble idea was part of the competition from the beginning and evolved over time. In the beginning, we used different models with different parametrization and a linear blending. . . . [Eventually] the linear blend was replaced by a nonlinear one.
In the years since, the use of ensemble methods has exploded, and they have emerged as a state-of-the-art technology for machine learning.
The next two sections provide a gentle introduction to what ensemble methods are, why they work, and where they are applied. Then, we’ll look at a subtle but important challenge prevalent in all machine-learning algorithms: the fit versus complexity tradeoff.
Finally, we jump into training our very first ensemble method for a hands-on view of how ensemble methods overcome this fit versus complexity tradeoff and improve overall performance. Along the way, you’ll become familiar with several key terms that form the lexicon of ensemble methods and will be used throughout the book.
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

What exactly is an ensemble method? Let’s get an intuitive idea of ensemble methods and how they work by considering the allegorical case of Dr. Randy Forrest. We can then go on to frame the ensemble learning problem.
Dr. Randy Forrest is a famed and successful diagnostician, much like his idol Dr. Gregory House of TV fame. His success, however, is due not only to his exceeding politeness (unlike his cynical and curmudgeonly idol) but also his rather unusual approach to diagnosis.
You see, Dr. Forrest works at a teaching hospital and commands the respect of a large number of doctors-in-training. Dr. Forrest has taken care to assemble a team with a diversity of skills (this is pretty important, and we’ll see why shortly). His residents excel at different specializations: one is good at cardiology (heart), another at pulmonology (lungs), yet another at neurology (nervous system), and so on. All in all, the group is a rather diversely skillful bunch, each with their own strengths.
Every time Dr. Forrest gets a new case, he solicits the opinions of his residents and collects possible diagnoses from all of them (see figure 1.1). He then democratically selects the final diagnosis as the most common one from among all those proposed.
Figure 1.1 The diagnostic procedure followed by Dr. Randy Forrest every time he gets a new case is to ask all of his residents their opinions of the case. His residents offer their diagnoses: either the patient does or does not have cancer. Dr. Forrest then selects the majority answer as the final diagnosis put forth by his team.

Dr. Forrest embodies a diagnostic ensemble: he aggregates his residents’ diagnoses into a single diagnosis representative of the collective wisdom of his team. As it turns out, Dr. Forrest is right more often than any individual resident because he knows that his residents are pretty smart, and a large number of pretty smart residents are unlikely to all make the same mistake. Here, Dr. Forrest relies on the power of model aggregating or model averaging: he knows that the average answer is most likely going to be a good one.
Still, how does Dr. Forrest know that all his residents aren’t wrong? He can’t know that for sure, of course. However, he has guarded against this undesirable outcome all the same. Remember that his residents all have diverse specializations. Because of their diverse backgrounds, training, specialization, and skills, it’s possible, but highly unlikely, that all his residents are wrong. Here, Dr. Forrest relies on the power of ensemble diversity, or the diversity of the individual components of his ensemble.
Dr. Randy Forrest, of course, is an ensemble method, and his residents (who are in training) are the machine-learning algorithms that make up the ensemble. The secrets to his success, and indeed the success of ensemble methods as well, are
- Ensemble diversity—He has a variety of opinions to choose from.
- Model aggregation—He can combine those opinions into a single final opinion.
Any collection of machine-learning algorithms can be used to build an ensemble, which is, literally, a group of machine learners. But why do they work? James Surowiecki, in The Wisdom of Crowds, describes human ensembles or wise crowds thus:
If you ask a large enough group of diverse and independent people to make a prediction or estimate a probability, the average of those answers will cancel out errors in individual estimation. Each person’s guess, you might say, has two components: information and errors. Subtract the errors, and you’re left with the information.
This is also precisely the intuition behind ensembles of learners: it’s possible to build a wise machine-learning ensemble by aggregating individual learners.
The key to success with ensemble methods is ensemble diversity, also known by alternate terms such as model complementarity or model orthogonality. Informally, ensemble diversity refers to the fact that individual ensemble components, or machine-learning models, are different from each other. Training such ensembles of diverse individual models is a key challenge in ensemble learning, and different ensemble methods achieve this in different ways.
discuss

What can you do with ensemble methods? Are they really just hype, or are they hallelujah? As we see in this section, they can be used to train and deploy robust and effective predictive models for many different applications.
One palpable success of ensemble methods is their domination of data science competitions (alongside deep learning), where they have been generally successful on different types of machine-learning tasks and application areas.
Anthony Goldbloom, CEO of Kaggle, revealed in 2015 that the three most successful algorithms for structured problems were XGBoost, random forest, and gradient boosting, all ensemble methods. Indeed, the most popular way to tackle data science competitions these days is to combine feature engineering with ensemble methods. Structured data is generally organized in tables, relational databases, and other formats most of us are familiar with, and ensemble methods have proven to be very successful on this type of data.
Unstructured data, in contrast, doesn’t always have a tabular structure. Images, audio, video, waveform, and text data are typically unstructured, and deep learning approaches—including automated feature generation—have been very successful on these types of data. While we focus on structured data for most of this book, ensemble methods can be combined with deep learning for unstructured problems as well.
Beyond competitions, ensemble methods drive data science in several areas, including financial and business analytics, medicine and health care, cybersecurity, education, manufacturing, recommendation systems, entertainment, and many more.
In 2018, Olson et al.2 conducted a comprehensive analysis of 14 popular machine-learning algorithms and their variants. They ranked each algorithm’s performance on 165 classification benchmark data sets. Their goal was to emulate the standard machine-learning pipeline to provide advice on how to select a machine-learning algorithm.
These comprehensive results are compiled into figure 1.2. Each row shows how often one model outperforms other models across all 165 data sets. For example, XGBoost beats gradient boosting on 34 of 165 benchmark data sets (first row, second column), while gradient boosting beats XGBoost on 12 of 165 benchmark data sets (second row, first column). Their performance is very similar on the remaining 119 of 165 data sets, meaning both models perform equally well on 119 data sets.
Figure 1.2 Which machine-learning algorithm should I use for my data set? The performance of several different machine-learning algorithms, relative to each other on 165 benchmark data sets, is shown here. The final trained models are ranked (top-to-bottom, left-to-right) based on their performance on all benchmark data sets in relation to all other methods. In their evaluation, Olson et al. consider two methods to have the same performance on a data set if their prediction accuracies are within 1% of each other. This figure was reproduced using the codebase and comprehensive experimental results compiled by the authors into a publicly available GitHub repository (https://github.com/rhiever/sklearn-benchmarks) and includes the authors’ evaluation of XGBoost as well.

In contrast, XGBoost beats multinomial naïve Bayes (MNB) on 157 of 165 data sets (first row, last column), while MNB only beats XGBoost on 2 of 165 data sets (last row, first column) and can only match XGBoost on 6 of 165 data sets!
In general, ensemble methods (1: XGBoost, 2: gradient boosting, 3: Extra Trees, 4: random forests, 8: AdaBoost) outperformed other methods handily. These results demonstrate exactly why ensemble methods (specifically, tree-based ensembles) are considered state of the art.
If your goal is to develop state-of-the-art analytics from your data, or to eke out better performance and improve models you already have, this book is for you. If your goal is to start competing more effectively in data science competitions for fame and fortune or to just improve your data science skills, this book is also for you. If you’re excited about adding powerful ensemble methods to your machine-learning arsenal, this book is definitely for you.
To drive home this point, we’ll build our first ensemble method: a simple model combination ensemble. Before we do, let’s dive into the tradeoff between fit and complexity that most machine-learning methods have to grapple with, as it will help us understand why ensemble methods are so effective.
settings

In this section, we look at two popular machine-learning methods: decision trees and support vector machines (SVMs). As we do so, we’ll explore how their fitting and predictive behavior changes as they learn increasingly complex models. This section also serves as a refresher of the training and evaluation practices we usually follow during modeling.
- Supervised learning tasks—These have a data set of labeled examples, where data has been annotated. For example, in cancer diagnoses, each example will be an individual patient, with label/annotation “has cancer” or “does not have cancer.” Labels can be 0-1 (binary classification), categorical (multiclass classification), or continuous (regression).
- Unsupervised learning tasks—These have a data set of unlabeled examples, where the data lacks annotations. This includes tasks such as grouping examples together by some notion of “similarity” (clustering) or identifying anomalous data that doesn’t fit the expected pattern (anomaly detection).
We’ll create a simple, synthetically generated, supervised regression data set to illustrate the key challenge in training machine-learning models and to motivate the need for ensemble methods. With this data set, we’ll train increasingly complex machine-learning models that fit and eventually overfit the data during training. As we’ll see, overfitting during training doesn’t necessarily produce models that generalize better.
One of the most popular machine-learning models is the decision tree,3 which can be used for classification as well as regression tasks. A decision tree is made up of decision nodes and leaf nodes, and each decision node tests the current example for a specific condition.
For example, in figure 1.3, we use a decision-tree classifier for a binary classification task over a data set with two features, x1 and x2.The first node tests each input example to see if the second feature x2 > 5 and then funnels the example to the right or left branch of the decision tree depending on the result. This continues until the input example reaches a leaf node; at this point, the prediction corresponding to the leaf node is returned. For classification tasks, the leaf value is a class label, whereas for regression tasks, the leaf returns a regression value.
Figure 1.3 Decision trees partition the feature space into axis-parallel rectangles. When used for classification, the tree checks for conditions on the features in the decision nodes, funneling the example to the left or right after each test. Ultimately, the example filters down to a leaf node, which will give its classification label. The partition of the feature space according to this decision tree is shown on the left.

A decision tree of depth 1 is called a decision stump and is the simplest possible tree. A decision stump contains a single decision node and two leaf nodes. A shallow decision tree (say, depth 2 or 3) will have a small number of decision nodes and leaf nodes and is a simple model. Consequently, it can only represent simple functions.
On the other hand, a deeper decision tree is a more complex model with many more decision nodes and leaf nodes. A deeper decision tree, thus, can represent richer and more complex functions.
We’ll explore such tradeoffs between model fit and representation complexity in the context of a synthetic data set called Friedman-1, originally created by Jerome Friedman in 1991 to explore how well his new multivariate adaptive regression splines (MARS) algorithm was fitting high-dimensional data.
This data set was carefully generated to evaluate a regression method’s ability to only pick up true feature dependencies in the data set and ignore others. More specifically, the data set is generated to have 15 randomly generated features of which only the first 5 features are relevant to the target variable:

scikit-learn contains a built-in function that we can use to generate as much data in this scheme as possible:
from sklearn.datasets import make_friedman1 X, y = make_friedman1(n_samples=500, #1 n_features=15, #2 noise=0.3, #3 random_state=23)
We’ll randomly split the data set into a training set (with 67% of the data) and a test set (with 33% of the data) in order to illustrate the effects of the complexity versus fit more clearly.
TIP
During modeling, we often have to split the data into a training and a test set. How big should these sets be? If the fraction of the data that makes up the training set is too small, the model won’t have enough data to train. If the fraction of the data that makes up the test set is too small, there will be higher variation in our generalization estimates of how well the model performs on future data. A good rule of thumb for medium to large data sets (known as the Pareto principle) is to start with an 80%-20% train-test split. Another good rule for small data sets is to use the leave-one-out approach, where a single example is left out each time for evaluation, and the overall training and evaluation process is repeated for every example.
For different depths d = 1 to 10, we train a tree on the training set and evaluate it on the test set. When we look at the training errors and the test errors across different depths, we can identify the depth of the “best tree.” We characterize “best” in terms of an evaluation metric. For regression problems, there are several evaluation metrics: mean squared error (MSE), mean absolute deviation (MAD), coefficient of determination, and so on.
We’ll use the coefficient of determination, also known as the R2 score, which measures the proportion of the variance in the labels (y) that is predictable from the features (x).
One last thing to note is that we are splitting the data into a training set and test set randomly, which means that it’s possible to get very lucky or very unlucky in our split. To avoid the influence of randomness, we repeat our experiment K = 5 times and average the results across the runs. Why 5? This choice is often somewhat arbitrary, and you’ll have to decide whether you want less variation in the test errors (large values of K) or less computation time (small values of K).
for run = 1:5 (Xtrn, ytrn), (Xtst, ytst) = split data (X), labels (y) into training & test subsets randomly for depth d = 1:10 tree[d] = train decision tree of depth d on the training subset (Xtrn, ytrn) train_scores[run, d] = compute R2 score of tree[d] on the training set (Xtrn, ytrn) test_scores[run, d] = compute R2 score of tree[d] on the test set (Xtst, ytst) mean_train_score = average train_scores across runs mean_test_score = average test_scores across runs
The following code snippet does precisely this, and then it plots the training and test scores. Rather than explicitly implement the preceding pseudocode, the following code uses the scikit-learn function sklearn.model_selection.ShuffleSplit to automatically split the data into five different training and test subsets, and it uses sklearn.model_selection.validation_curve to determine R2 scores for varying decision tree depths:
import numpy as np from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import ShuffleSplit from sklearn.model_selection import validation_curve subsets = ShuffleSplit(n_splits=5, test_size=0.33, random_state=23) #1 model = DecisionTreeRegressor() trn_scores, tst_scores = validation_curve(model, X, y, #2 param_name='max_depth', param_range=range(1, 11), cv=subsets, scoring='r2') mean_train_score = np.mean(trn_scores, axis=1) mean_test_score = np.mean(tst_scores, axis=1)
Remember, our ultimate goal is to build a machine-learning model that generalizes well, that is, a model that performs well on future, unseen data. Our first instinct then, will be to train a model that achieves the smallest training error. Such models will typically be quite complex in order to fit as many training examples as possible. After all, a complex model will likely fit our training data well and have a small training error. It is natural to presume that a model that achieves the smallest training error should also generalize well in the future and predict unseen examples equally well.
Now, let’s look at the training and test scores in figure 1.4 to see if this is the case. Remember that an R2 score close to 1 indicates a very good regression model, and scores further away from 1 indicate worse models.
Deeper decision trees are more complex and have greater representational power, so it’s not surprising to see that deeper trees fit the training data better. This is clear from figure 1.4: as tree depth (model complexity) increases, the training score approaches R2 = 1. Thus, more complex models achieve better fits on the training data.
Figure 1.4 Comparing decision trees of different depths on the Friedman-1 regression data set using R2 as the evaluation metric. Higher R2 scores mean that the model achieves lower error and fits the data better. An R2 score close to 1 means that the model achieves nearly zero error. It’s possible to fit the training data nearly perfectly with very deep decision trees, but such overly complex models actually overfit the training data and don’t generalize well to future data, as evidenced by the test scores.

What is surprising, however, is that the test R2 score doesn’t similarly keep increasing with complexity. In fact, beyond max_depth=4, test scores remain fairly consistent. This suggests that a tree of depth 8 might fit the training data better than a tree of depth 4, but both trees will perform roughly identically when they try to generalize and predict on new data!
As decision trees become deeper, they get more complex and achieve lower training errors. However, their ability to generalize to future data (estimated by test scores) doesn’t keep decreasing. This is a rather counterintuitive result: the model with the best fit on the training set isn’t necessarily the best model for predictions when deployed in the real world.
It’s tempting to argue that we got unlucky when we partitioned the training and test sets randomly. However, we ran our experiment with five different random partitions and averaged the results to avoid this. To be sure, however, let’s repeat this experiment with another well-known machine-learning method: support vector regression.4
Like decision trees, support vector machines (SVMs) are a great off-the-shelf baseline modeling approach, and most packages come with a robust implementation of SVMs. You may have used SVMs for classification, where it’s possible to learn nonlinear models of considerable complexity using kernels such as the radial basis function (RBF) kernel, or the polynomial kernel. SVMs have also been adapted for regression, and as in the classification case, they try to find a model that trades off between regularization and fit during training. Specifically, SVM training tries to find a model to minimize

The regularization term measures the flatness of the model: the more it is minimized, the more linear and less complex the learned model is. The loss term measures the fit to the training data through a loss function (typically, MSE): the more it is minimized, the better the fit to the training data. The regularization parameter C trades off between these two competing objectives:
- A small value of C means the model will focus more on regularization and simplicity, and less on training error, which causes the model to have higher training error and underfit.
- A large value of C means the model will focus more on training error and learn more complex models, which causes the model to have lower training errors and possibly overfit.
We can see the effect of increasing the value of C on the learned models in figure 1.5. In particular, we can visualize the tradeoff between fit and complexity.
Figure 1.5 Support vector machine with an RBF kernel, with kernel parameter gamma = 0.75. Small values of C result in more linear (flatter) and less complex models that underfit the data, while large values of C result in more nonlinear (curvier) and more complex models that overfit the data. Selecting a good value for C is critically important in training a good SVM model.

CAUTION
Much like max_depth in DecisionTreeRegressor(), the parameter C in support vector regression, SVR(), can be tuned to obtain models with different behaviors. Again, we’re faced with the same question: which is the best model? To answer this, we can repeat the same experiment as with decision trees:
from sklearn.svm import SVR model = SVR(kernel='rbf', gamma=0.1) trn_scores, tst_scores = validation_curve(model, X, y.ravel(), param_name='C', param_range=np.logspace(-2, 4, 7), cv=subsets, scoring='r2') mean_train_score = np.mean(trn_scores, axis=1) mean_test_score = np.mean(tst_scores, axis=1)
In this code snippet, we train an SVM with a three-degree polynomial kernel. We try seven values of C—10-3, 10-2, 10-1, 1, 10, 102, and 103—and visualize the train and test scores, as before, in figure 1.6.
Figure 1.6 Comparing SVM regressors of different complexities on the Friedman-1 regression data set using R2 as the evaluation metric. As with decision trees, highly complex models (corresponding to higher C values) appear to achieve fantastic fit on the training data, but they don’t actually generalize as well. This means that as C increases, so does the possibility of overfitting.

Again, rather counterintuitively, the model with the best fit on the training set isn’t necessarily the best model for predictions when deployed in the real world. Every machine-learning algorithm, in fact, exhibits this behavior:
- Overly simple models tend to not fit the training data properly and tend to generalize poorly on future data; a model that is performing poorly on training and test data is underfitting.
- Overly complex models can achieve very low training errors but tend to generalize poorly on future data too; a model that is performing very well on training data, but poorly on test data is overfitting.
- The best models trade off between complexity and fit, sacrificing a little bit of each during training so that they can generalize most effectively when deployed.
As we’ll see in the next section, ensemble methods are an effective way of tackling the problem of fit versus complexity.
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

In this section, we’ll overcome the fit versus complexity problems of individual models by training our first ensemble. Recall from the allegorical Dr. Forrest that an effective ensemble performs model aggregation on a set of component models, as follows:
- We train a set of base estimators (also known as base learners) using diverse base-learning algorithms on the same data set. That is, we count on the significant variations in each learning algorithm to produce a diverse set of base estimators.
- For a regression problem (e.g., the Friedman-1 data introduced in the previous section), the predictions of individual base estimators are continuous. We can aggregate the results into one final ensemble prediction by simple averaging of the individual predictions.
We use the following regression algorithms to produce base estimators from our data set: kernel ridge regression, support vector regression, decision-tree regression, k-nearest neighbor regression, Gaussian processes, and multilayer perceptrons (neural networks).
Once we have the trained models, we use each one to make individual predictions and then aggregate the individual predictions into a final prediction, as shown in figure 1.7.
Figure 1.7 Our first ensemble method ensembles the predictions of six different regression models by averaging them. This simple ensemble illustrates two key principles of ensembling: (1) model diversity, achieved in this case by using six different base machine-learning models; and (2) model aggregation, achieved in this case by simple averaging across predictions.

Listing 1.1 Training diverse base estimators
from sklearn.model_selection import train_test_split from sklearn.datasets import make_friedman1 X, y = make_friedman1(n_samples=500, n_features=15, noise=0.3, random_state=23) #1 Xtrn, Xtst, ytrn, ytst = train_test_split( X, y, test_size=0.25) #2 from sklearn.kernel_ridge import KernelRidge from sklearn.svm import SVR from sklearn.tree import DecisionTreeRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.neural_network import MLPRegressor estimators = {'krr': KernelRidge(kernel='rbf', gamma=0.25), #3 'svr': SVR(gamma=0.5), 'dtr': DecisionTreeRegressor(max_depth=3), 'knn': KNeighborsRegressor(n_neighbors=4), 'gpr': GaussianProcessRegressor(alpha=0.1), 'mlp': MLPRegressor(alpha=25, max_iter=10000)} for name, estimator in estimators.items(): estimator = estimator.fit(Xtrn, ytrn) #4
We have now trained six diverse base estimators using six different base-learning algorithms. Given new data, we can aggregate the individual predictions into a final prediction as shown in the following listing.
Listing 1.2 Aggregating base estimator predictions
import numpy as np n_estimators, n_samples = len(estimators), Xtst.shape[0] y_individual = np.zeros((n_samples, n_estimators)) #1 for i, (model, estimator) in enumerate(estimators.items()): y_individual[:, i] = estimator.predict(Xtst) #2 y_final = np.mean(y_individual, axis=1) #3
One way to understand the benefits of ensembling is if we look at all possible combinations of models for predictions. That is, we look at the performance of one model at a time, then all possible ensembles of two models (there are 15 such combinations), then all possible ensembles of three models (there are 20 such combinations), and so on. For ensemble sizes 1 to 6, we plot the test set performances of all these ensemble combinations in figure 1.8.
Figure 1.8 Prediction performance versus ensemble size. When the ensemble size is 1, we can see that the performances of individual models are rather diverse. When the size is 2, we average the results of different pairs of models (in this case, 15 ensembles). When 3, we average the results of 3 models at a time (in this case, 20 ensembles), and so on, until the size is 6, when we average the results of all 6 models into a single, grand ensemble.

As we aggregate more and more models, we see that the ensembles generalize increasingly better. The most striking result of our experiment, though, is that the performance of the ensemble of all six estimators is often better than the performances of each individual estimator.
Finally, what of fit versus complexity? It’s difficult to characterize the complexity of the ensemble, as different types of estimators in our ensemble have different complexities. However, we can characterize the variance of the ensemble.
Recall that variance of an estimator reflects its sensitivity to the data. A high variance estimator is highly sensitive and less robust, often because it’s overfitting. In figure 1.9, we show the variance of the ensembles from figure 1.8, which is the width of the band.
Figure 1.9 The mean performance of the ensemble combinations increases, showing that bigger ensembles perform better. The standard deviation (square root of the variance) of the performance of ensemble combinations decreases, showing that the overall variance decreases!

As ensemble size increases, the variance of the ensemble decreases! This is a consequence of model aggregation or averaging. We know that averaging “smooths out the rough edges.” In the case of our ensemble, averaging individual predictions smooths out mistakes made by individual base estimators, replacing them instead with the wisdom of the ensemble: from many, one. The overall ensemble is more robust to mistakes and, unsurprisingly, generalizes better than any single base estimator.
Each component estimator in the ensemble is an individual, like one of Dr. Forrest’s residents, and each makes predictions based on its own experiences (introduced during learning). At prediction time, when we have six individuals, we’ll have six predictions, or six opinions. For “easy examples,” the individuals will mostly agree. For “difficult examples,” the individuals will differ among each other but, on average, are more likely to be closer to the correct answer.5
In this simple scenario, we trained six “diverse” models by using six different learning algorithms. Ensemble diversity is critical to the success of the ensemble as it ensures that the individual estimators are different from each other and don’t all make the same mistakes.
As we’ll see over and over again in each chapter, different ensemble methods take different approaches to train diverse ensembles. Before we end this chapter, let’s take a look at a broad classification of various ensembling techniques, many of which will be covered in the next few chapters.
discuss

All ensembles are composed of individual machine-learning models called base models, base learners, or base estimators (these terms are used interchangeably throughout the book) and are trained using base machine-learning algorithms. Base models are often described in terms of their complexity. Base models that are sufficiently complex (e.g., a deep decision tree) and have “good” prediction performance (e.g., accuracy over 80% for a binary classification task) are typically known as strong learners or strong models.
In contrast, base models that are pretty simple (e.g., a shallow decision tree) and achieve barely acceptable performance (e.g., accuracy around 51% for a binary classification task) are known as weak learners or weak models. More formally, a weak learner only has to do slightly better than random chance, or 50% for a binary classification task. As we’ll see shortly, ensemble methods use either weak learners or strong learners as base models.
More broadly, ensemble methods can be classified into two types depending on how they are trained: parallel and sequential ensembles. This is the taxonomy we’ll adopt in this book as it gives us a neat way of grouping the vast number of ensemble methods out there (see figure 1.10).
Parallel ensemble methods, as the name suggests, train each component base model independently of the others, which means that they can be trained in parallel. Parallel ensembles are often constructed out of strong learners and can further be categorized into the following:
- Homogeneous parallel ensembles—All the base learners are of the same type (e.g., all decision trees) and trained using the same base-learning algorithm. Several well-known ensemble methods, such as bagging, random forests, and extremely randomized trees (Extra Trees), are parallel ensemble methods. These are covered in chapter 2.
- Heterogeneous parallel ensembles—The base learners are trained using different base-learning algorithms. Meta-learning by stacking is a well-known exemplar of this type of ensembling technique. These are covered in chapter 3.
Sequential ensemble methods, unlike parallel ensemble methods, exploit the dependence of base learners. More specifically, during training, sequential ensembles train a new base learner in such a manner that it minimizes mistakes made by the base learner trained in the previous step. These methods construct ensembles sequentially in stages and often use weak learners as base models. They can also be further categorized into the following:
- Adaptive boosting ensembles—Also called vanilla boosting, these ensembles train successive base learners by reweighting examples adaptively to fix mistakes in previous iterations. AdaBoost, the granddaddy of all the boosting methods, is an example of this type of ensemble method. These are covered in chapter 4.
- Gradient-boosting ensembles—These ensembles extend and generalize the idea of adaptive boosting and aim to mimic gradient descent, which is often used under the hood to actually train machine-learning models. Some of the most powerful modern ensemble learning packages implement some form of gradient boosting (LightGBM, chapter 5), Newton boosting (XGBoost, chapter 6), or ordered boosting (CatBoost, chapter 8).
- Ensemble learning aims to improve predictive performance by training multiple models and combining them into a meta-estimator. The component models of an ensemble are called base estimators or base learners.
- Ensemble methods use the power of “the wisdom of crowds,” which relies on the principle that the collective opinion of a group is more effective than any single individual in the group.
- Ensemble methods are widely used in several application areas, including financial and business analytics, medicine and health care, cybersecurity, education, manufacturing, recommendation systems, entertainment, and many more.
- Most machine-learning algorithms contend with a fit versus complexity (also called bias-variance) tradeoff, which affects their ability to generalize well to future data. Ensemble methods use multiple component models to overcome this tradeoff.
- An effective ensemble requires two key ingredients: (1) ensemble diversity and (2) model aggregation for the final predictions.
1. Andreas Töscher, Michael Jahrer, and Robert M. Bell, “The BigChaos Solution to the Netflix Grand Prize,” (http://mng.bz/9V4r).
2. Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore, Data-driven Advice for Applying Machine Learning to Bioinformatics Problems, Pacific Symposium on Machine Learning (2018); arXiv preprint: https://arxiv.org/abs/1708.05070.
3. For more details about learning with decision trees, see chapters 3 (classification) and 9 (regression) of Machine Learning in Action by Peter Harrington (Manning, 2012).
4. For more details on SVMs for classification, see chapter 6 of Machine Learning in Action by Peter Harrington (Manning, 2012). For SVMs for regression, see “A Tutorial on Support Vector Regression” by Alex J. Smola and Bernhard Scholköpf (Statistics and Computing, 2004), as well as the documentation pages of sklearn.SVM.SVR().
5. There are cases when this breaks down. In the UK version of Who Wants To Be A Millionaire?, a contestant successfully made it as far as £125,000 (or about $160,000), when he was asked which novel begins with the words: “3 May. Bistritz. Left Munich at 8:35 PM.” After using the 50/50 lifeline, he was left with only two choices: Tinker Tailor Soldier Spy and Dracula. Knowing he could lose £93,000 if he got it wrong, he asked the studio audience. In response, 81% of the audience voted for Tinker Tailor Soldier Spy. The audience was overwhelmingly confident and—unfortunately for the contestant—overwhelmingly wrong. As you’ll see in the book, we look to avoid this situation by making certain assumptions about the “audience,” which, in our case, is the base estimators.