Chapter 9. Scaling machine-learning workflows


This chapter covers

  • Determining when to scale up workflows for model accuracy and prediction throughput
  • Avoiding unnecessary investments in complex scaling strategies and heavy infrastructure
  • Ways to scale linear ML algorithms to large amounts of training data
  • Approaches to scaling nonlinear ML algorithms—usually a much greater challenge
  • Decreasing latency and increasing throughput of predictions

In real-world machine-learning applications, scalability is often a primary concern. Many ML-based systems are required to quickly crunch new data and produce predictions, because the predictions become useless after a few milliseconds (for instance, think of real-time applications such as the stock market or clickstream data). On the other hand, other machine-learning applications need to be able to scale during model training, to learn on gigabytes or terabytes of data (think about learning a model from an internet-scale image corpus).

In previous chapters, you worked mostly with data that’s small enough to fit, process, and model on a single machine. For many real-world problems, this may be sufficient to solve the problem at hand, but plenty of applications require scaling to multiple machines and sometimes hundreds of machines in the cloud. This chapter is about deciding on a scaling strategy and learning about the technologies involved.

9.1. Before scaling up

9.2. Scaling ML modeling pipelines

9.3. Scaling predictions

9.4. Summary

9.5. Terms from this chapter



big data A broad term usually used to denote data management and processing problems that can’t fit on single machines.
horizontal/vertical scaling Scaling out horizontally means adding more machines to handle more data. Scaling up vertically means upgrading the hardware of your machines.
Hadoop, HDFS, MapReduce, Mahout The Hadoop ecosystem is widely used in science and industry for handling and processing large amounts of data. HDFS and MapReduce are the distributed storage and parallel processing systems respectively, and Mahout is the machine-learning component of the Hadoop ecosystem.
Apache Spark, MLlib Apache Spark is a newer project that tries to keep data in memory to make it much more efficient than the disk-based Hadoop. MLlib is the machine-learning library that comes with Spark.
data locality Doing computation on the data where it resides. Data transfer can often be the bottleneck in big-data projects, so avoiding transferring data can result in a big gain in resource requirements.
polynomial features A trick to extend linear models to include nonlinear polynomial feature interaction terms without losing the scalability of linear learning algorithms.
Vowpal Wabbit An ML tool for building models efficiently on large datasets without necessarily using a full big-data system such as Hadoop.
out-of-core Computations are done out of core if you need to keep only the current iteration of data in memory.
histogram approximations Approximations of the training data that convert all columns to histograms for the learning process.
feature selection Process of reducing the size of training data by selecting and retaining the best (most predictive) subset of features.
Lasso Linear algorithm that selects the most predictive subset of features. Very useful for feature selection.
deep neural nets An evolution of neural nets that scales to larger datasets and achieves state-of-the-art accuracy. Requires more knowledge and computational resources in practice than other algorithms, depending on the dataset and problem at hand.
prediction volume/velocity Scaling prediction volume means being able to handle a lot of data. Scaling velocity means being able to do it fast enough for a specific real-time use case.
accuracy vs. speed For real-time predictions, you can sometimes trade accuracy of the prediction for the speed with which the prediction is made.
Spark Streaming, Apache Storm, Apache Kafka, AWS Kinesis Upcoming technologies for building real-time streaming systems.