chapter five

5 Practicing scalability and performance

This chapter covers

Developing a realistic, performant data science project iteratively
Using the compute layer to power demanding operations, such as parallelized model training
Optimizing the performance of numerical Python code
Using various techniques to make your workflows more scalable and performant

In the previous chapter, we discussed how scalability is not only about being able to handle more demanding algorithms or handle more data. At the organizational level, the infrastructure should scale to a large number of projects developed by a large number of people. We recognized that scalability and performance are separate concerns—you can have one without the other. In fact, the different dimensions of scalability and performance can be at odds with each other.

Imagine having an experienced engineer implementing a highly optimized, high-performance solution in the C++ language. Although the solution scales at the technical level, it is not very scalable organizationally if no one else in the team knows the C++ language. Conversely, you can imagine a very high-level ML solution that builds models with a click of a button. Everyone knows how to click the button, but the solution is too inflexible to scale to a wide variety of projects and is unable to handle large amounts of data. This chapter advocates for a pragmatic approach to scalability and performance that features the following points:

5.1 Starting simple: Vertical scalability

5.1.1 Example: Clustering Yelp reviews

5 Practicing scalability and performance

This chapter covers

5.1 Starting simple: Vertical scalability

5.1.1 Example: Clustering Yelp reviews

5.1.2 Practicing vertical scalability

5.1.3 Why vertical scalability?

5.2 Practicing horizontal scalability

5.2.1 Why horizontal scalability?

5.2.2 Example: Hyperparameter search

5.3 Practicing performance optimization

5.3.1 Example: Computing a co-occurrence matrix

5.3.2 Recipe for fast-enough workflows