4 Scaling with the compute layer

This chapter covers

Designing scalable infrastructure that allows data scientists to handle computationally demanding projects
Choosing a cloud-based compute layer that matches your needs
Configuring and using compute layers in Metaflow
Developing robust workflows that handle failures gracefully

What are the most fundamental building blocks of all data science projects? First, by definition, data science projects use data. At least small amounts of data are needed by all machine learning and data science projects. Second, the science part of data science implies that we don’t merely collect data but we use it for something, that is, we compute something using data. Correspondingly, data and compute are the two most foundational layers of our data science infrastructure stack, depicted in figure 4.1.

Figure 4.1 Data science infrastructure stack with the compute layer highlighted

Managing and accessing data is such a deep and broad topic that we postpone an in-depth discussion about it until chapter 7. In this chapter, we focus on the compute layer of the stack, which answers a seemingly simple question: After a data scientist has defined a piece of code, such as a step in a workflow, where should we execute it?

4.1 What is scalability?

4.1.1 Scalability across the stack

4.1.2 Culture of experimentation

4.2 The compute layer

4.2.1 Batch processing with containers

4.2.2 Examples of compute layers

4.3 The compute layer in Metaflow

4.3.1 Configuring AWS Batch for Metaflow

4.3.2 @batch and @resources decorators

4.4 Handling failures

4.4.1 Recovering from transient errors with @retry

4.4.2 Killing zombies with @timeout

Summary