2 Homogeneous parallel ensembles: Bagging and random forests

This chapter covers

Training homogeneous parallel ensembles
Implementing and understanding bagging
Implementing and understanding how random forests work
Training variants with pasting, random subspaces, random patches, and Extra Trees
Using bagging and random forests in practice

In chapter 1, we introduced ensemble learning and created our first rudimentary ensemble. To recap, an ensemble method relies on the notion of “wisdom of the crowd”: the combined answer of many models is often better than any one individual answer. We begin our journey into ensemble learning methods in earnest with parallel ensemble methods. We begin with this type of ensemble method because, conceptually, parallel ensemble methods are easy to understand and implement.

Parallel ensemble methods, as the name suggests, train each component base estimator independently of the others, which means that they can be trained in parallel. As we’ll see, parallel ensemble methods can be further distinguished as homogeneous and heterogeneous parallel ensembles depending on the kind of learning algorithms they use.

2.1 Parallel ensembles

2.2 Bagging: Bootstrap aggregating

2.2.1 Intuition: Resampling and model aggregation

2.2.2 Implementing bagging

2.2.3 Bagging with scikit-learn

2.2.4 Faster training with parallelization

2.3 Random forests

2.3.1 Randomized decision trees

2.3.2 Random forests with scikit-learn

2.3.3 Feature importances

2.4 More homogeneous parallel ensembles

2.4.1 Pasting

2.4.2 Random subspaces and random patches

2.4.3 Extra Trees