This chapter covers:
- Using Jupyter notebooks with AWS Sagemaker
- Analyzing summary statistics of the DC taxi dataset
- Evaluating alternative dataset sizes for machine learning
- Using statistical measures to choose the right machine learning dataset size
- Implementing dataset sampling in a PySpark job
In the last chapter, you started with the analysis of the DC taxi fare dataset. After the dataset was converted to an analysis-friendly Apache Parquet format, you crawled the data schema for the dataset and used the Athena interactive querying service to explore the data. These first steps of data exploration surfaced numerous data quality issues, motivating you to establish a rigorous approach to deal with the "garbage-in garbage out" problem in your machine learning project. Next, you learned about the VACUUM (Valid, Accurate, Consistent, Uniform, and Unified Model) principles for data quality along with several case studies illustrating the real-world relevance of the principles. Finally, you applied the VACUUM principles to the DC taxi dataset to "clean" it and to prepare a dataset of sufficient quality to proceed with sampling from the dataset for machine learning.