chapter two

2 Homogeneous Parallel Ensembles: Bagging and Random Forests

This chapter covers

Training homogeneous parallel ensembles
Implementing and understanding how Bagging works
Implementing and understanding how Random Forest works
Training variants with pasting, random subspaces, random patches and ExtraTrees
Using bagging and random forests in practice

In Chapter 1, we introduced ensemble learning and created our first rudimentary ensemble. To recap, an ensemble method relies on the notion of “wisdom of the crowd”: the combined answer of many diverse models is often better than any one individual answer.

We begin our journey into ensemble learning methods in earnest with parallel ensemble methods. We begin with this type of ensemble methods because, conceptually, parallel ensemble methods are easy to understand and implement.

Parallel ensemble methods, as the name suggests, train each component base estimator independently of the others, which means that they can be trained in parallel. As we will see, parallel ensemble methods can be further distinguished as homogeneous and heterogeneous parallel ensembles depending on the kind of learning algorithms they use.

In this chapter, we will learn about homogeneous parallel ensembles, whose component models are all trained using the same machine-learning algorithm. This is in contrast to heterogeneous parallel ensembles (covered in the next chapter), whose component models are trained using different machine-learning algorithms.

2.1 Parallel Ensembles

2.2 Bagging: Bootstrap Aggregating

2.2.1 Intuition: Resampling and Model Aggregation

2.2.2 Implementing Bagging

2.2.3 Bagging with scikit-learn

2 Homogeneous Parallel Ensembles: Bagging and Random Forests

This chapter covers

2.1 Parallel Ensembles

2.2 Bagging: Bootstrap Aggregating

2.2.1 Intuition: Resampling and Model Aggregation

2.2.2 Implementing Bagging

2.2.3 Bagging with scikit-learn

2.2.4 Faster Training with Parallelization

2.3 Random Forests

2.3.1 Randomized Decision Trees

2.3.2 Random Forests with scikit-learn

2.3.3 Feature Importances

2.4 More Homogeneous Parallel Ensembles

2.4.1 Pasting

2.4.2 Random Subspaces and Random Patches

2.4.3 ExtraTrees

2.5 Case Study: Breast Cancer Diagnosis

2.5.1 Loading and pre-processing

2.5.2 Bagging, Random Forests and ExtraTrees