chapter four

4 More exploratory data analysis and data preparation

This chapter covers

Analyzing summary statistics of the DC taxi data set
Evaluating alternative data set sizes for machine learning
Using statistical measures to choose the right machine learning data set size
Implementing data set sampling in a PySpark job

In the last chapter, you started with the analysis of the DC taxi fare data set. After the data set was converted to an analysis-friendly Apache Parquet format, you crawled the data schema and used the Athena interactive querying service to explore the data. These first steps of data exploration surfaced numerous data quality issues, motivating you to establish a rigorous approach to deal with the garbage in, garbage out problem in your machine learning project. Next, you learned about the VACUUM principles for data quality along with several case studies illustrating the real-world relevance of the principles. Finally, you applied VACUUM to the DC taxi data set to “clean” it and prepare a data set of sufficient quality to proceed with sampling from the data set for machine learning.

4.1 Getting started with data sampling

4 More exploratory data analysis and data preparation

This chapter covers

4.1 Getting started with data sampling

4.1.1 Exploring the summary statistics of the cleaned-up data set

4.1.2 Choosing the right sample size for the test data set

4.1.3 Exploring the statistics of alternative sample sizes

4.1.4 Using a PySpark job to sample the test set

Summary