10 Machine learning with Dask-ML

This chapter covers

Building machine learning models using the Dask-ML API
Using the Dask-ML API to extend scikit-learn
Validating models and tuning hyperparameters using cross-validated gridsearch
Using serialization to save and publish trained models

A common admission by data scientists is that the 80/20 rule definitely applies to data science: that is, 80% of time spent on data science projects is preparing data for machine learning and the other 20% is actually building and testing the machine learning models. This book is no exception! By now, we’ve been through the gathering, cleaning, and exploration process for two different datasets in two different “flavors”—using DataFrames and using Bags. It’s now time to move on and build some machine learning models of our own! For a point of reference, figure 10.1 shows how we’re progressing through our workflow. We’ve almost arrived at the end!

Figure 10.1 Having thoroughly covered data preparation, it’s time to move on to model building.

In this chapter, we’ll have a look at the last major API of Dask: Dask-ML. Just as we’ve seen how Dask DataFrames parallelize Pandas and Dask Arrays parallelize NumPy, Dask-ML is a parallel implementation of scikit-learn. Figure 10.2 shows the relationship between the Dask APIs and the underlying functionality they provide.

Figure 10.2 A review of the API components of Dask

10 Machine learning with Dask-ML

This chapter covers

Figure 10.1 Having thoroughly covered data preparation, it’s time to move on to model building.

Figure 10.2 A review of the API components of Dask

10.1 Building linear models with Dask-ML

10.1.1 Preparing the data with binary vectorization

10.1.2 Building a logistic regression model with Dask-ML

10.2 Evaluating and tuning Dask-ML models

10.2.1 Evaluating Dask-ML models with the score method

10.2.2 Building a naïve Bayes classifier with Dask-ML

10.2.3 Automatically tuning hyperparameters

10.3 Persisting Dask-ML models

Summary