Chapter 9. Scaling machine-learning workflows

This chapter covers

Determining when to scale up workflows for model accuracy and prediction throughput
Avoiding unnecessary investments in complex scaling strategies and heavy infrastructure
Ways to scale linear ML algorithms to large amounts of training data
Approaches to scaling nonlinear ML algorithms—usually a much greater challenge
Decreasing latency and increasing throughput of predictions

In real-world machine-learning applications, scalability is often a primary concern. Many ML-based systems are required to quickly crunch new data and produce predictions, because the predictions become useless after a few milliseconds (for instance, think of real-time applications such as the stock market or clickstream data). On the other hand, other machine-learning applications need to be able to scale during model training, to learn on gigabytes or terabytes of data (think about learning a model from an internet-scale image corpus).

In previous chapters, you worked mostly with data that’s small enough to fit, process, and model on a single machine. For many real-world problems, this may be sufficient to solve the problem at hand, but plenty of applications require scaling to multiple machines and sometimes hundreds of machines in the cloud. This chapter is about deciding on a scaling strategy and learning about the technologies involved.

9.1. Before scaling up

9.2. Scaling ML modeling pipelines

9.3. Scaling predictions

9.4. Summary

9.5. Terms from this chapter

Word	Definition
big data	A broad term usually used to denote data management and processing problems that can’t fit on single machines.
horizontal/vertical scaling	Scaling out horizontally means adding more machines to handle more data. Scaling up vertically means upgrading the hardware of your machines.
Hadoop, HDFS, MapReduce, Mahout	The Hadoop ecosystem is widely used in science and industry for handling and processing large amounts of data. HDFS and MapReduce are the distributed storage and parallel processing systems respectively, and Mahout is the machine-learning component of the Hadoop ecosystem.
Apache Spark, MLlib	Apache Spark is a newer project that tries to keep data in memory to make it much more efficient than the disk-based Hadoop. MLlib is the machine-learning library that comes with Spark.
data locality	Doing computation on the data where it resides. Data transfer can often be the bottleneck in big-data projects, so avoiding transferring data can result in a big gain in resource requirements.
polynomial features	A trick to extend linear models to include nonlinear polynomial feature interaction terms without losing the scalability of linear learning algorithms.
Vowpal Wabbit	An ML tool for building models efficiently on large datasets without necessarily using a full big-data system such as Hadoop.
out-of-core	Computations are done out of core if you need to keep only the current iteration of data in memory.
histogram approximations	Approximations of the training data that convert all columns to histograms for the learning process.
feature selection	Process of reducing the size of training data by selecting and retaining the best (most predictive) subset of features.
Lasso	Linear algorithm that selects the most predictive subset of features. Very useful for feature selection.
deep neural nets	An evolution of neural nets that scales to larger datasets and achieves state-of-the-art accuracy. Requires more knowledge and computational resources in practice than other algorithms, depending on the dataset and problem at hand.
prediction volume/velocity	Scaling prediction volume means being able to handle a lot of data. Scaling velocity means being able to do it fast enough for a specific real-time use case.
accuracy vs. speed	For real-time predictions, you can sometimes trade accuracy of the prediction for the speed with which the prediction is made.
Spark Streaming, Apache Storm, Apache Kafka, AWS Kinesis	Upcoming technologies for building real-time streaming systems.