3 Exploring and preparing the data set

 

This chapter covers

  • Getting started with AWS Athena for interactive querying
  • Choosing between manually specified and discovered data schemas
  • Approaching data quality with VACUUM normative principles
  • Analyzing DC taxi data quality through interactive querying
  • Implementing data quality processing in PySpark

In the previous chapter, you imported the DC taxi data set into AWS and stored it in your project’s S3 object storage bucket. You created, configured, and ran an AWS Glue data catalog crawler that analyzed the data set and discovered its data schema. You also learned about the column-based data storage formats (e.g., Apache Parquet) and their advantages over row-based formats for analytical workloads. At the conclusion of the chapter, you used a PySpark job running on AWS Glue to convert the original, row-based, comma-separated values (CSV) format of the DC taxi data set to Parquet and stored it in your S3 bucket.

3.1 Getting started with interactive querying

3.1.1 Choosing the right use case for interactive querying

3.1.2 Introducing AWS Athena

3.1.3 Preparing a sample data set

3.1.4 Interactive querying using Athena from a browser

3.1.5 Interactive querying using a sample data set

3.1.6 Querying the DC taxi data set

3.2 Getting started with data quality

3.2.1 From “garbage in, garbage out” to data quality

3.2.2 Before starting with data quality

3.2.3 Normative principles for data quality

3.3 Applying VACUUM to the DC taxi data

3.3.1 Enforcing the schema to ensure valid values

3.3.2 Cleaning up invalid fare amounts

3.3.3 Improving the accuracy

3.4 Implementing VACUUM in a PySpark job