4 Scaling with The Compute Layer


This chapter covers:

  • Designing scalable infrastructure that allows data scientists to handle computationally demanding projects.
  • Choosing a cloud-based compute layer that matches your needs.
  • Configuring and using compute layers in Metaflow.
  • Developing robust workflows that handle failures gracefully.

What are the most fundamental building blocks of all data science projects? First, by definition data science projects use data. There isn’t always huge amounts of data involved but arguably you can’t do machine learning and data science without any data. Second, the science part of data science implies that we don’t merely collect data but we use it for something, that is, we compute something using data.

Correspondingly, data and compute are the two most foundational layers of our data science infrastructure stack, depicted in Figure 4.1.

Figure 4.1. Data science infrastructure stack with the compute layer highlighted

How to manage and access data is such a deep and broad topic that we postpone an in-depth discussion about it until Chapter 7. In this chapter, we focus on the compute layer of the stack which answers a seemingly simple question: After a data scientist has defined a piece of code, such as a step in a workflow, where should we execute it?

4.1 What is Scalability

4.1.1 Culture of experimentation

Minimize interference to maximize scalability

4.2 The Compute Layer

4.2.1 Batch processing with containers

Why do containers matter?

From a container to a scalable compute layer

4.2.2 Examples of compute layers


AWS Batch

AWS Lambda

Apache Spark

Distributed training platforms

Local processes


4.3 The compute layer in Metaflow

4.3.1 Configuring AWS Batch for Metaflow

Choosing the Compute Environment

Configuring the container

The first run with AWS Batch

4.3.2 @batch and @resources decorators

Specifying resource requirements in the code

4.4 Handling failures

4.4.1 Recovering from transient errors with @retry

Avoiding retries selectively

4.4.2 Killing zombies with @timeout

4.4.3 The decorator of the last resort: @catch

Summary: Hardening a workflow gradually

4.5 Summary