chapter two

2 Homogeneous Parallel Ensembles: Bagging and Random Forests

 

This chapter covers

  • Training homogeneous parallel ensembles
  • Implementing and understanding how Bagging works
  • Implementing and understanding how Random Forest works
  • Training variants with pasting, random subspaces, random patches and ExtraTrees
  • Using bagging and random forests in practice

In Chapter 1, we introduced ensemble learning and created our first rudimentary ensemble. To recap, an ensemble method relies on the notion of “wisdom of the crowd”: the combined answer of many diverse models is often better than any one individual answer.

We begin our journey into ensemble learning methods in earnest with parallel ensemble methods. We begin with this type of ensemble methods because, conceptually, parallel ensemble methods are easy to understand and implement.

Parallel ensemble methods, as the name suggests, train each component base estimator independently of the others, which means that they can be trained in parallel. As we will see, parallel ensemble methods can be further distinguished as homogeneous and heterogeneous parallel ensembles depending on the kind of learning algorithms they use.

In this chapter, we will learn about homogeneous parallel ensembles, whose component models are all trained using the same machine-learning algorithm. This is in contrast to heterogeneous parallel ensembles (covered in the next chapter), whose component models are trained using different machine-learning algorithms.

2.1          Parallel Ensembles

2.2          Bagging: Bootstrap Aggregating

2.2.1                       Intuition: Resampling and Model Aggregation

2.2.2                       Implementing Bagging

2.2.3                       Bagging with scikit-learn

2.2.4                       Faster Training with Parallelization

2.3          Random Forests

2.3.1                       Randomized Decision Trees

2.3.2                       Random Forests with scikit-learn

2.3.3                       Feature Importances

2.4          More Homogeneous Parallel Ensembles

2.4.1                       Pasting

2.4.2                       Random Subspaces and Random Patches

2.4.3                       ExtraTrees

2.5          Case Study: Breast Cancer Diagnosis

2.5.1                       Loading and pre-processing

2.5.2                       Bagging, Random Forests and ExtraTrees