chapter four

4 More exploratory data analysis and data preparation

This chapter covers:

Using Jupyter notebooks with AWS Sagemaker
Analyzing summary statistics of the DC taxi dataset
Evaluating alternative dataset sizes for machine learning
Using statistical measures to choose the right machine learning dataset size
Implementing dataset sampling in a PySpark job

In the last chapter, you started with the analysis of the DC taxi fare dataset. After the dataset was converted to an analysis-friendly Apache Parquet format, you crawled the data schema for the dataset and used the Athena interactive querying service to explore the data. These first steps of data exploration surfaced numerous data quality issues, motivating you to establish a rigorous approach to deal with the "garbage-in garbage out" problem in your machine learning project. Next, you learned about the VACUUM (Valid, Accurate, Consistent, Uniform, and Unified Model) principles for data quality along with several case studies illustrating the real-world relevance of the principles. Finally, you applied the VACUUM principles to the DC taxi dataset to "clean" it and to prepare a dataset of sufficient quality to proceed with sampling from the dataset for machine learning.

4.1 Interactive notebooks with AWS Sagemaker

4.1.1 Interactive Python programming with Jupyter notebooks

4 More exploratory data analysis and data preparation

This chapter covers:

4.1 Interactive notebooks with AWS Sagemaker

4.1.1 Interactive Python programming with Jupyter notebooks

4.1.2 Using Sagemaker for data exploration with Jupyter notebooks

4.2 Getting started with data sampling

4.2.1 Exploring the summary statistics of the cleaned up dataset

4.2.2 Choosing the right sample size for the test dataset

4.2.3 Exploring the statistics of alternative sample sizes

4.2.4 Calculating p-values for a specific sample size

4.3 Summary