2 Getting started with the data set

This chapter covers

Introducing a use case for machine learning
Starting with object storage for serverless machine learning
Using crawlers to automatically discover structured data schemas
Migrating to column-oriented data storage for more efficient analytics
Experimenting with PySpark extract-transform-load (ETL) jobs

In the previous chapter, you learned about serverless machine learning platforms and some of the reasons they can help you build a successful machine learning system. In this chapter, you will get started with a pragmatic, real-world use case for a serverless machine learning platform. Next, you are asked to download a data set of a few years’ worth of taxi rides from Washington, DC, to build a machine learning model for the use case. As you get familiar with the data set and learn about the steps for using it to build a machine learning model, you are introduced to the key technologies that are a part of a serverless machine learning platform, including object storage, data crawlers, metadata catalogs, and distributed data processing (extract-transform-load) services. By the conclusion of the chapter, you will also see examples with code and shell commands that illustrate how these technologies can be used with Amazon Web Services (AWS) so that you can apply what you learned in your own AWS account.

2.1 Introducing the Washington, DC taxi rides data set

2.1.1 What is the business use case?

2.1.2 What are the business rules?

2.1.3 What is the schema for the business service?

2.1.4 What are the options for implementing the business service?

2.1.5 What data assets are available for the business service?

2.1.6 Downloading and unzipping the data set

2.2 Starting with object storage for the data set

2.2.1 Understanding object storage vs. filesystems

2.2.2 Authenticating with Amazon Web Services

2.2.3 Creating a serverless object storage bucket

2.3 Discovering the schema for the data set

2.3.1 Introducing AWS Glue

2.3.2 Authorizing the crawler to access your objects