Chapter 6. Example: NYC taxi data


This chapter covers

  • Introducing, visualizing, and preparing a real-world dataset about NYC taxi trips
  • Building a classification model to predict passenger tipping habits
  • Optimizing an ML model by tuning model parameters and engineering features
  • Building and optimizing a regression model to predict tip amount
  • Using models to gain a deeper understanding of data and the behavior it describes

In the previous five chapters, you learned how to go from raw, messy data to building, validating, and optimizing models by tuning parameters and engineering features that capture the domain knowledge of the problem. Although we’ve used a variety of minor examples throughout these chapters to illustrate the points of the individual sections, it’s time for you to use the knowledge you’ve acquired and work through a full, real-world example. This is the first of three chapters (along with chapters 8 and 10) entirely dedicated to a full, real-world example.

In the first section of this chapter, you’ll take a closer look at the data and various useful visualizations that help you gain a better understanding of the possibilities of the data. We explain how the initial data preparation is performed, so the data will be ready for the modeling experiments in the subsequent sections. In the second section, you’ll set up a classification problem and improve the performance of the model by tuning model parameters and engineering new features.

6.1. Data: NYC taxi trip and fare information

6.2. Modeling

6.3. Summary

6.4. Terms from this chapter



open data Data made available publicly by institutions and organizations.
FOIL Freedom of Information Law. (The federal version is known as the Freedom of Information Act, or FOIA.)
too-good-to-be-true scenario If a model is extremely accurate compared to what you would have thought, chances are that some features in the model, or some data peculiarities, are causing the model to “cheat.”
premature assumptions Assuming something about the data without validation, risking biasing your views of the results.