Chapter 6. Example: NYC taxi data

This chapter covers

Introducing, visualizing, and preparing a real-world dataset about NYC taxi trips
Building a classification model to predict passenger tipping habits
Optimizing an ML model by tuning model parameters and engineering features
Building and optimizing a regression model to predict tip amount
Using models to gain a deeper understanding of data and the behavior it describes

In the previous five chapters, you learned how to go from raw, messy data to building, validating, and optimizing models by tuning parameters and engineering features that capture the domain knowledge of the problem. Although we’ve used a variety of minor examples throughout these chapters to illustrate the points of the individual sections, it’s time for you to use the knowledge you’ve acquired and work through a full, real-world example. This is the first of three chapters (along with chapters 8 and 10) entirely dedicated to a full, real-world example.

In the first section of this chapter, you’ll take a closer look at the data and various useful visualizations that help you gain a better understanding of the possibilities of the data. We explain how the initial data preparation is performed, so the data will be ready for the modeling experiments in the subsequent sections. In the second section, you’ll set up a classification problem and improve the performance of the model by tuning model parameters and engineering new features.

6.1. Data: NYC taxi trip and fare information

6.2. Modeling

6.3. Summary

6.4. Terms from this chapter

Word	Definition
open data	Data made available publicly by institutions and organizations.
FOIL	Freedom of Information Law. (The federal version is known as the Freedom of Information Act, or FOIA.)
too-good-to-be-true scenario	If a model is extremely accurate compared to what you would have thought, chances are that some features in the model, or some data peculiarities, are causing the model to “cheat.”
premature assumptions	Assuming something about the data without validation, risking biasing your views of the results.

Chapter 6. Example: NYC taxi data

This chapter covers

6.1. Data: NYC taxi trip and fare information

6.2. Modeling

6.3. Summary

6.4. Terms from this chapter

Unable to load book!