Chapter 6. Example: NYC taxi data
This chapter covers
- Introducing, visualizing, and preparing a real-world dataset about NYC taxi trips
- Building a classification model to predict passenger tipping habits
- Optimizing an ML model by tuning model parameters and engineering features
- Building and optimizing a regression model to predict tip amount
- Using models to gain a deeper understanding of data and the behavior it describes
In the previous five chapters, you learned how to go from raw, messy data to building, validating, and optimizing models by tuning parameters and engineering features that capture the domain knowledge of the problem. Although we’ve used a variety of minor examples throughout these chapters to illustrate the points of the individual sections, it’s time for you to use the knowledge you’ve acquired and work through a full, real-world example. This is the first of three chapters (along with chapters 8 and 10) entirely dedicated to a full, real-world example.
In the first section of this chapter, you’ll take a closer look at the data and various useful visualizations that help you gain a better understanding of the possibilities of the data. We explain how the initial data preparation is performed, so the data will be ready for the modeling experiments in the subsequent sections. In the second section, you’ll set up a classification problem and improve the performance of the model by tuning model parameters and engineering new features.
With companies and organizations producing more and more data, a large set of rich and interesting datasets has become available in recent years. In addition, some of these organizations are embracing the concept of open data, enabling the public dissemination and use of the data by any interested party.
Recently, the New York State Freedom of Information Law (FOIL) made available an extremely detailed dataset of New York City taxi trip records from every taxi trip of 2013.[1] This dataset collected various sets of information on each individual taxi trips including the pickup and drop-off location, time and duration of the trip, distance travelled, and fare amount. You’ll see that this data qualifies as real-world data, not only because of the way it has been generated but also in the way that it’s messy: there are missing data, spurious records, unimportant columns, baked-in biases, and so on.
1Initially released in a blog post by Chris Wong: http://chriswhong.com/open-data/foil_nyc_taxi/.
And speaking of data, there’s a lot of it! The full dataset is over 19 GB of CSV data, making it too large for many machine-learning implementations to handle on most systems. For simplicity, in this chapter you’ll work with a smaller subset of the data. In chapters 9 and 10, you’ll investigate methods that are able to scale to sizes like this and even larger, so by the end of the book you’ll know how to analyze all 19 GB of data.
The data is available for download at www.andresmh.com/nyctaxitrips/. The dataset consists of 12 pairs of trip/fare compressed CSV files. Each file contains about 14 million records, and the trip/fare files are matched line by line.
You’ll follow our basic ML workflow: analyzing the data; extracting features; building, evaluating, and optimizing models; and predicting on new data. In the next subsection, you’ll look at the data by using some of the visualization methods from chapter 2.
As you get started with a new problem, the first step is to gain an understanding of what the dataset contains. We recommend that you start by loading the dataset and viewing it in tabular form. For this chapter, we’ve joined the trip/fare lines into a single dataset. Figure 6.1 shows the first six rows of data.
Figure 6.1. The first six rows of the NYC taxi trip and fare record data. Most of the columns are self-explanatory, but we introduce some of them in more detail in the text that follows.

The medallion and hack_license columns look like simple ID columns that are useful for bookkeeping but less interesting from an ML perspective. From their column names, a few of the columns look like categorical data, like vendor_id, rate_code, store_and_fwd_flag, and payment_type. For individual categorical variables, we recommend visualizing their distributions either in tabular form or as bar plots. Figure 6.2 uses bar plots to show the distribution of values in each of these categorical columns.
Figure 6.2. The distribution of values across some of the categorical-looking columns in our dataset

Next, let’s look at some of the numerical columns in the dataset. It’s interesting to validate, for example, that correlations exist between things like trip duration (trip_time_in_secs), distance, and total cost of a trip. Figure 6.3 shows scatter plots of some of these factors plotted against each other.
Figure 6.3. Scatter plots of taxi trips for the time in seconds versus the trip distance, and the time in seconds versus the trip amount (USD), respectively. A certain amount of correlation exists, as expected, but the scatter is still relatively high. Some less-logical clusters also appear, such as a lot of zero-time trips, even expensive ones, which may indicate corrupted data entries.

Finally, in figure 6.4, you can visualize the pickup locations in the latitude/longitude space, defining a map of NYC taxi trips. The distribution looks reasonable, with most pickup locations occurring in downtown Manhattan, many occurring in the other boroughs, and surprisingly a few happening in the middle of the East River!
Figure 6.4. The latitude/longitude of pickup locations. Note that the x-axis is flipped, compared to a regular map. You can see a huge number of pickups in Manhattan, falling off as you move away from the city center.

With a fresh perspective on the data you’re dealing with, let’s go ahead and dream up a realistic problem that you can solve with this dataset by using machine learning.
When we first looked at this data, a particular column immediately grabbed our attention: tip_amount. This column stores the information about the amount of the tip (in US dollars) given for each ride. It would be interesting to understand, in greater detail, what factors most influence the amount of the tip for any given NYC taxi trip.
To this end, you might want to build a classifier that uses all of the trip information to try to predict whether a passenger will tip a driver. With such a model, you could predict tip versus no tip at the end of each trip. A taxi driver could have this model installed on a mobile device and would get no-tip alerts and be able to alter the situation before it was too late. While you wait for approval for having your app installed in all NYC taxis, you can use the model to give you insight into which parameters are most important, or predictive, of tip versus no tip in order to attempt to boost overall tipping on a macro level. Figure 6.5 shows a histogram of the tip amount across all taxi trips.
Figure 6.5. The distribution of tip amount. Around half the trips yielded $0 tips, which is more than we’d expect intuitively.

So the plan for our model is to predict which trips will result in no tip, and which will result in a tip. This is a job for a binary classifier. With such a classifier, you’ll to be able to do the following:
- Assist the taxi driver by providing an alert to predicted no-tip situations
- Gain understanding of how and why such a situation might arise by using the dataset to uncover the driving factors (pun intended!) behind incidence of tipping in NYC taxi rides
Before you start building this model, we’ll tell you the real story of how our first attempt at tackling this problem was quite unsuccessful, disguised as very successful—the worst kind of unsuccessful—and how we fixed it. This type of detour is extremely common when working with real data, so it’s helpful to include the lessons learned here. When working with machine learning, it’s critical to watch out for two pitfalls: too-good-to-be-true scenarios and making premature assumptions that aren’t rooted in the data.
As a general rule in ML, if the cross-validated accuracy is higher than you’d have expected, chances are your model is cheating somewhere. The real world is creative when trying to make your life as a data scientist difficult. When building initial tip/no-tip classification models, we quickly obtained a very high cross-validated predictive accuracy of the model. Because we were so excited about the model performance on this newly acquired dataset—we nailed it—we temporarily ignored the warnings of a cheating model. But having been bitten by such things many times before, the overly optimistic results caused us to investigate further.
One of the things we looked at was the importance of the input features (as you’ll see in more detail in later sections). In our case, a certain feature totally dominated in terms of feature importance in the model: payment type.
From our own taxi experience, this could make sense. People paying with credit cards (in the pre-Square era) may have a lower probability of tipping. If you pay with cash, you almost always round up to whatever you have the bills for. So we started segmenting the number of tips versus no tips for people paying with a credit card rather than cash. Alas, it turned out that the vast majority (more than 95%) of the millions of passengers paying with a credit card did tip. So much for that theory.
So how many people paying with cash tipped? All of them?
In actuality, none of the passengers paying with cash had tipped! Then it quickly became obvious. Whenever a passenger paid with cash and gave a tip, the driver didn’t register it in whatever way was necessary for it to be included as part of our data. By going through our ML sanity checks, we unearthed millions of instances of potential fraud in the NYC taxi system!
Returning to the implications for our ML model: in a situation like this, when there’s a problem in the generation of the data, there’s simply no way to trust that part of the data for building an ML model. If the answers are incorrect in nefarious ways, then what the ML model learns may be completely incorrect and detached from reality.
Ultimately, to sidestep the problem, we opted to remove from the dataset all trips paid for with cash. This modified the objective: to predict the incidence of tipping for only noncash payers. It always feels wrong to throw away data, but in this case we decided that under the new data-supported assumption that all cash-payment data was untrustworthy, the best option was to use the noncash data to answer a slightly different problem. Of course, there’s no guarantee that other tip records aren’t wrong as well, but we can at least check the new distribution of tip amounts. Figure 6.6 shows the histogram of tip amounts after filtering out any cash-paid trips.
Figure 6.6. The distribution of tip amounts when omitting cash payments (after discovering that cash tips are never recorded in the system)

With the bad data removed, the distribution is looking much better: only about 5% of trips result in no tip. Our job in the next section is to find out why.
With the data prepared for modeling, you can easily use your knowledge from chapter 3 to set up and evaluate models. In the following subsections, you’ll build different versions of models, trying to improve the performance with each iteration.
You’ll start this modeling endeavor as simply as possible. You’ll work with a simple, logistic regression algorithm. You’ll also restrict yourself initially to the numerical values in the dataset, because those are handled by the logistic regression algorithm naturally, without any data preprocessing.
You’ll use the scikit-learn and pandas libraries in Python to develop the model. Before building the models, we shuffled the instances randomly and split them into 80% training and 20% holdout testing sets. You also need to scale the data so no column is considered more important than others a priori. If the data has been loaded into a pandas DataFrame, the code to build and validate this model looks something like the following listing.
The last part of listing 6.1 plots the ROC curve for this first, simple classifier. The holdout ROC curve is shown in figure 6.7.
Figure 6.7. The receiver operating characteristic (ROC) curve of the logistic regression tip/no-tip classifier. With an area under the curve (AUC) of 0.5, the model seems to perform no better than random guessing. Not a good sign for our model.

There’s no way around it: the performance of this classifier isn’t good! With a holdout AUC of 0.51, the model is no better than random guessing (flipping a coin weighted 95% “tip” and 5% “no tip” to predict each trip), which is, for obvious reasons, not useful. Luckily, we started out simply and have a few ways of trying to improve the performance of this model.
The first thing you’ll try is to switch to a different algorithm—one that’s nonlinear. Considering how poor the first attempt was, it seems that a linear model won’t cut it for this dataset; simply put, tipping is a complicated process! Instead, you’ll use a nonlinear algorithm called random forest, well known for its high level of accuracy on real-world datasets. You could choose any of a number of other algorithms (see the appendix), but we’ll leave it as an exercise for you to evaluate and compare different algorithms. Here’s the code (relative to the previous model) for building this model.
The results of running the code in listing 6.2 are shown in figure 6.8. You can see a significant increase in holdout accuracy—the holdout AUC is now 0.64—showing clearly that there’s a predictive signal in the dataset. Some combinations of the input features are capable of predicting whether a taxi trip will yield any tips from the passenger. If you’re lucky, further feature engineering and optimization will be able to boost the accuracy levels even higher.
Figure 6.8. The ROC curve of the nonlinear random forest model. The AUC is significantly better: at 0.64, it’s likely that there’s a real signal in the dataset.

You can also use the model to gain insight into what features are most important in this moderately predictive model. This exercise is a crucial step for a couple of reasons:
- It enables you to identify any cheating features (for example, the problem with noncash payers) and to use that as insight to rectify any issues.
- It serves as a launching point for further feature engineering. If, for instance, you identify latitude and longitude as the most important features, you can consider deriving other features from those metrics, such as distance from Times Square. Likewise, if there’s a feature that you thought would be important but it doesn’t appear on the top feature list, then you’ll want to analyze, visualize, and potentially clean up or transform that feature.
Figure 6.9 (also generated by the code in listing 6.2) shows the list of features and their relative importance for the random forest model. From this figure, you can see that the location features are the most important, along with time, trip distance, and fare amount. It may be that riders in some parts of the city are less patient with slow, expensive rides, for example. You’ll look more closely at the potential insights gained in section 6.2.5.
Figure 6.9. The important features of the random forest model. The drop-off and pickup location features seem to dominate the model.

Now that you’ve chosen the algorithm, let’s make sure you’re using all of the raw features, including categorical columns and not just plain numerical columns.
Without going deeper into the realm of feature engineering, you can perform some simple data preprocessing to increase the accuracy.
In chapter 2, you learned how to work with categorical features. Some ML algorithms work with categorical features directly, but you’ll use the common trick of “Booleanizing” the categorical features: creating a column of value 0 or 1 for each of the possible categories in the feature. This makes it possible for any ML algorithm to handle categorical data without changes to the algorithm itself.
The code for converting all of the categorical features is shown in the following listing.
After creating the Booleanized columns, you run the data through listing 6.2 again and obtain the ROC curve and feature importance list shown in figure 6.10. Note that your holdout AUC has risen slightly, from 0.64 to 0.656.
Figure 6.10. The ROC curve and feature importance list of the random forest model with all categorical variables converted to Boolean (0/1) columns, one per category per feature. The new features are bringing new useful information to the table, because the AUC is seen to increase from the previous model without categorical features.

As model performance increases, you can consider additional factors. You haven’t done any real feature engineering, of course, because the data transformations applied so far are considered basic data preprocessing.
At this point, it’s time to start working with the data to produce new features, what you’ve previously known as feature engineering. In chapter 5, we introduced a set of date-time features transforming date and timestamps into numerical columns. You can easily imagine the time of the day or day of the week to have some kind of influence on how a passenger will tip.
The code for calculating these features is presented in the following listing.
With these date-time features, you can build a new model. You run the data through the code in listing 6.2 once again and obtain the ROC curve and feature importance shown in figure 6.11.
Figure 6.11. The ROC curve and feature importance list for the random forest model, including all categorical features and additional date-time features

You can see an evolution in the accuracy of the model with additional data preprocessing and feature engineering. At this point, you’re able to predict whether a passenger will tip the driver with an accuracy significantly above random. Up to now, you’ve looked only at improving the data in order to improve the model, but you can try to improve this model in two other ways:
- Vary the model parameters to see whether the default values aren’t necessarily the most optimal
- Increase the dataset size
In this chapter, we’ve been heavily subsampling the dataset in order for the algorithms to handle the dataset, even on a 16 GB–memory machine. We’ll talk more about scalability of methods in chapters 9 and 10, but in the meantime we’ll leave it to you to work with this data to increase the cross-validated accuracy even further!
It’s interesting to gain insight about the data through the act of building a model to predict a certain answer. From the feature importance list, you can understand which parameters have the most predictive power, and you use that to look at the data in new ways. In our initial unsuccessful attempt, it was because of inspection of the feature importance list that we discovered the problem with the data. In the current working model, you can also use the list to inspire some new visualizations.
At every iteration of our model in this section, the most important features have been the pickup and drop-off location features. Figure 6.12 plots the geographical distribution of drop-offs that yield tips from the passenger, as well as drop-offs from trips that don’t.
Figure 6.12 shows an interesting trend of not tipping when being dropped off closer to the center of the city. Why is that? One possibility is that the traffic situation creates many slow trips, and the passenger isn’t necessarily happy with the driver’s behavior. As a non–US-citizen, I have another theory. This particular area of the city has a high volume of both financial workers and tourists. We’d expect the financial group to be distributed farther south on Manhattan. There’s another reason that tourists are the most likely cause of this discrepancy, in my mind: many countries have vastly different rules for tipping than in the United States. Some Asian countries almost never tip, and many northern European countries tip much less, and rarely in taxis. You can make many other interesting investigations based on this dataset. The point is, of course, that real-world data can often be used to say something interesting about the real world and the people generating the data.
This chapter introduced a dataset from the real world and defined a problem suitable for the machine-learning knowledge that you’ve built up over the previous five chapters. You went through the entire ML workflow, including initial data preparation, feature engineering, and multiple iterations of model building, evaluation, optimization, and prediction. The main takeaways from the chapter are these:
- With more organizations producing vast amounts of data, increasing amounts of data are becoming available within organizations, if not publicly.
- Records of all taxi trips from NYC in 2013 have been released publicly. A lot of taxi trips occur in NYC in one year!
- Real-world data can be messy. Visualization and knowledge about the domain helps. Don’t get caught in too-good-to-be-true scenarios and don’t make premature assumptions about the data.
- Start iterating from the simplest possible model. Don’t spend time on premature optimization. Gradually increase complexity.
- Make choices and move on; for example, choose an algorithm early on. In an ideal world, you’d try all combinations at all steps in the iterative process of building a model, but you’d have to fix some things in order to make progress.
- Gain insights into the model and the data in order to learn about the domain and potentially improve the model further.
6.4. Terms from this chapter
Definition |
|
---|---|
open data | Data made available publicly by institutions and organizations. |
FOIL | Freedom of Information Law. (The federal version is known as the Freedom of Information Act, or FOIA.) |
too-good-to-be-true scenario | If a model is extremely accurate compared to what you would have thought, chances are that some features in the model, or some data peculiarities, are causing the model to “cheat.” |
premature assumptions | Assuming something about the data without validation, risking biasing your views of the results. |