In the previous chapter, you imported the DC taxi data set into AWS and stored it in your project’s S3 object storage bucket. You created, configured, and ran an AWS Glue data catalog crawler that analyzed the data set and discovered its data schema. You also learned about the column-based data storage formats (e.g., Apache Parquet) and their advantages over row-based formats for analytical workloads. At the conclusion of the chapter, you used a PySpark job running on AWS Glue to convert the original, row-based, comma-separated values (CSV) format of the DC taxi data set to Parquet and stored it in your S3 bucket.