chapter three

3 Exploring and preparing the data set

This chapter covers

Getting started with AWS Athena for interactive querying
Choosing between manually specified and discovered data schemas
Approaching data quality with VACUUM normative principles
Analyzing DC taxi data quality through interactive querying
Implementing data quality processing in PySpark

In the previous chapter, you imported the DC taxi data set into AWS and stored it in your project’s S3 object storage bucket. You created, configured, and ran an AWS Glue data catalog crawler that analyzed the data set and discovered its data schema. You also learned about the column-based data storage formats (e.g., Apache Parquet) and their advantages over row-based formats for analytical workloads. At the conclusion of the chapter, you used a PySpark job running on AWS Glue to convert the original, row-based, comma-separated values (CSV) format of the DC taxi data set to Parquet and stored it in your S3 bucket.

3.1 Getting started with interactive querying

3.1.1 Choosing the right use case for interactive querying

3.1.2 Introducing AWS Athena

3.1.3 Preparing a sample data set

3.1.4 Interactive querying using Athena from a browser

3.1.5 Interactive querying using a sample data set

3.1.6 Querying the DC taxi data set

3.2 Getting started with data quality

3.2.1 From “garbage in, garbage out” to data quality

3.2.2 Before starting with data quality

3.2.3 Normative principles for data quality

3.3 Applying VACUUM to the DC taxi data

3.3.1 Enforcing the schema to ensure valid values

3.3.2 Cleaning up invalid fare amounts

3.3.3 Improving the accuracy

3.4 Implementing VACUUM in a PySpark job