1 Ensemble methods: Hype or hallelujah?


This chapter covers

  • Defining and framing the ensemble learning problem
  • Motivating the need for ensembles in different applications
  • Understanding how ensembles handle fit versus complexity
  • Implementing our first ensemble with ensemble diversity and model aggregation

In October 2006, Netflix announced a $1 million prize for the team that could improve movie recommendations by 10% via Netflix’s own proprietary recommendation system, CineMatch. The Netflix Grand Prize was one of the first-ever open data science competitions and attracted tens of thousands of teams.

The training set consisted of 100 million ratings that 480,000 users had given to 17,000 movies. Within three weeks, 40 teams had already beaten CineMatch’s results. By September 2007, more than 40,000 teams had entered the contest, and a team from AT&T Labs took the 2007 Progress Prize by improving upon CineMatch by 8.42%.

As the competition progressed with the 10% mark remaining elusive, a curious phenomenon emerged among the competitors. Teams began to collaborate and share knowledge about effective feature engineering, algorithms, and techniques. Inevitably, they began combining their models, blending individual approaches into powerful and sophisticated ensembles of many models. These ensembles combined the best of various diverse models and features, and they proved to be far more effective than any individual model.

1.1 Ensemble methods: The wisdom of the crowds

1.2 Why you should care about ensemble learning

1.3 Fit vs. complexity in individual models

1.3.1 Regression with decision trees

1.3.2 Regression with support vector machines

1.4 Our first ensemble

1.5 Terminology and taxonomy for ensemble methods