chapter one

1 Ensemble Methods: Hype or Hallelujah?

This chapter covers

Defining and framing the ensemble learning problem
Motivating the need for ensembles in different applications
Understanding how ensembles handle fit vs. complexity
Implementing our first ensemble with ensemble diversity and model aggregation

In October 2006, Netflix announced a $1 million prize for the team that was able to improve movie recommendations over their own proprietary recommendation system, Cinematch, by 10%. The Netflix Grand Prize was one of the first ever open data science competitions and attracted tens of thousands of teams.

The training set consisted of 100 million ratings that 480 thousand users had given to 17 thousand movies. Within three weeks, 40 teams had already beaten Cinematch’s results. By September 2007, over 40 thousand teams had entered the contest and a team from AT&T Labs took the 2007 Progress Prize by improving upon Cinematch by 8.42%.

As the competition progressed and the 10% mark remained elusive, a curious phenomenon emerged among the competitors. Teams began to collaborate and share knowledge about effective feature engineering, algorithms and techniques. Inevitably, they began combining their models, blending individual approaches into powerful and sophisticated ensembles of many models. These ensembles combined the best of various diverse models and features and proved to be far more effective than any individual model.

1 Ensemble Methods: Hype or Hallelujah?

This chapter covers

1.1 Ensemble Methods: The Wisdom of the Crowds

1.2 Why You Should Care About Ensemble Learning

1.3 Fit vs. Complexity in Individual Models

1.3.1 Regression with Decision Trees

1.3.2 Regression with support vector machines

1.4 Our First Ensemble

1.5 Summary