4 Preparing the data, part 2: Transforming the data

 

This chapter covers

  • Dealing with more incorrect values
  • Mapping complex, multiword values to single tokens
  • Fixing type mismatches
  • Dealing with rows that still contain bad values after cleanup
  • Creating new columns derived from existing columns
  • Preparing categorical and text columns to train a deep learning model
  • Reviewing the end-to-end solution introduced in chapter 2

In chapter 3, we corrected a set of errors and anomalies in the input dataset. There’s still more cleanup and preparation to be done in the dataset, and that’s what we’ll do in this chapter. We’ll deal with remaining issues (including multiword tokens and type mismatches) and go over your choices about how to deal with the bad values that are still present after all the cleanup. Then we’ll go over creating derived columns and how to prepare non-numeric data to train a deep learning model. Finally, we’ll take a closer look at the end-to-end solution introduced in chapter 2 to see how the data preparation steps we have completed so far fit into our overall journey to a deployed, trained deep learning model for predicting streetcar delays.

You will see a consistent theme throughout this chapter: making updates to the dataset so that it more closely matches the real-world situation of streetcar delays. By eliminating errors and ambiguities to make the dataset better match the real world, we increase our chances of getting an accurate deep learning model.

4.1 Code for preparing and transforming the data

4.2 Dealing with incorrect values: Routes

4.3 Why only one substitute for all bad values?

4.4 Dealing with incorrect values: Vehicles

4.5 Dealing with inconsistent values: Location

4.6 Going the distance: Locations

4.7 Fixing type mismatches

4.8 Dealing with rows that still contain bad data

4.9 Creating derived columns

4.10 Preparing non-numeric data to train a deep learning model