4 Scaling with The Compute Layer

 

This chapter covers:

  • Designing scalable infrastructure that allows data scientists to handle computationally demanding projects.
  • Choosing a cloud-based compute layer that matches your needs.
  • Configuring and using compute layers in Metaflow.
  • Developing robust workflows that handle failures gracefully.

What are the most fundamental building blocks of all data science projects? First, by definition data science projects use data. There isn’t always huge amounts of data involved but arguably you can’t do machine learning and data science without any data. Second, the science part of data science implies that we don’t merely collect data but we use it for something, that is, we compute something using data.

Correspondingly, data and compute are the two most foundational layers of our data science infrastructure stack, depicted in Figure 4.1.

Figure 4.1. Data science infrastructure stack with the compute layer highlighted

How to manage and access data is such a deep and broad topic that we postpone an in-depth discussion about it until Chapter 7. In this chapter, we focus on the compute layer of the stack which answers a seemingly simple question: After a data scientist has defined a piece of code, such as a step in a workflow, where should we execute it?

4.1 What is Scalability

 
 

4.1.1 Culture of experimentation

 
 

Minimize interference to maximize scalability

 
 

4.2 The Compute Layer

 
 
 

4.2.1 Batch processing with containers

 
 
 

Why do containers matter?

 
 
 
 

From a container to a scalable compute layer

 
 

4.2.2 Examples of compute layers

 
 
 
 

Kubernetes

 

AWS Batch

 
 
 

AWS Lambda

 
 

Apache Spark

 
 

Distributed training platforms

 
 

Local processes

 
 
 

Comparison

 

4.3 The compute layer in Metaflow

 
 

4.3.1 Configuring AWS Batch for Metaflow

 
 

Choosing the Compute Environment

 
 
 

Configuring the container

 
 
 

The first run with AWS Batch

 
 
 
 

4.3.2 @batch and @resources decorators

 
 
 

Specifying resource requirements in the code

 
 

4.4 Handling failures

 

4.4.1 Recovering from transient errors with @retry

 

Avoiding retries selectively

 
 
 

4.4.2 Killing zombies with @timeout

 
 
 

4.4.3 The decorator of the last resort: @catch

 

Summary: Hardening a workflow gradually

 
 
 

4.5 Summary

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest