chapter five

5 Practicing Scalability and Performance

This chapter covers:

Developing a realistic, performant data science project iteratively.
Using the compute layer to power demanding operations, such as parallelized model training.
Optimizing performance of numerical Python code.
Various techniques that you can use to make your workflows more scalable and performant.

In the previous chapter, we discussed how scalability is not only about being able to handle more demanding algorithms or handle more data. At the organizational level, the infrastructure should scale to a large number of projects developed by a large number of people. We recognized that scalability and performance are separate concerns - you can have one without the other. In fact, the different dimensions of scalability and performance can be at odds with each other.

Imagine having an experienced engineer implementing a highly optimized, high-performance solution in the C++ language. While the solution scales at the technical level, it is not very scalable organizationally - no one else in the team knows the C++ language. Conversely, you can imagine a very high-level ML solution which builds models with a click of a button: Everyone knows how to click the button but the solution is too inflexible to scale to a wide variety of projects, and it is unable to handle large amounts of data.

This chapter advocates for a pragmatic approach to scalability and performance:

5.1 Starting simple: Vertical scalability

5.1.1 Example: Clustering Yelp Reviews

One minute primer to Natural Language Processing

5.1.2 Practicing vertical scalability

Defining dependencies with @conda

5.1.3 Why vertical scalability

Simplicity boosts performance and productivity

Consider the nature of your problem

5.2 Practicing Horizontal Scalability

5.2.1 Why horizontal scalability

Embarrassingly parallel tasks

Large datasets

Distributed algorithms

5.2.2 Example: Hyperparameter search

Inspecting results

5.3 Practicing performance optimization

5.3.1 Example: Computing a co-occurrence matrix

Variant 1: A plain Python implementation

Variant 2: Leveraging a high-performance library

Variant 3: Compiling Python with Numba

Variant 4: Parallelizing the algorithm over multiple CPU cores

Summarizing the variants

5.3.1 Recipe for fast enough workflows

5.4 Summary