In the previous chapter, you started with a cleaned-up version of the DC taxi data set and applied a data-driven sampling procedure in order to identify the right fraction of the data set to allocate to a held-out, test data subset. You also analyzed the results of the sampling experiments and then launched a PySpark job to generate three separate subsets of data: training, validation, and test.
This chapter takes you on a temporary detour from the DC taxi data set to prepare you to write scalable machine learning code using PyTorch. Don’t worry; chapter 7 returns to the DC taxi data set to benchmark a baseline PyTorch machine learning model. In this chapter, you will focus on learning about PyTorch, one of the top frameworks for deep learning and many other types of machine learning algorithms. I have used TensorFlow 2.0, Keras, and PyTorch for machine learning projects that required distributed training on a machine learning platform and found PyTorch to be the best one. PyTorch scales from mission-critical, production machine learning use cases at Tesla1 to state-of-the-art research at OpenAI.2