10 Machine learning with Dask-ML
This chapter covers
- Building machine learning models using the Dask-ML API
- Using the Dask-ML API to extend scikit-learn
- Validating models and tuning hyperparameters using cross-validated gridsearch
- Using serialization to save and publish trained models
A common admission by data scientists is that the 80/20 rule definitely applies to data science: that is, 80% of time spent on data science projects is preparing data for machine learning and the other 20% is actually building and testing the machine learning models. This book is no exception! By now, we’ve been through the gathering, cleaning, and exploration process for two different datasets in two different “flavors”—using DataFrames and using Bags. It’s now time to move on and build some machine learning models of our own! For a point of reference, figure 10.1 shows how we’re progressing through our workflow. We’ve almost arrived at the end!
Figure 10.1 Having thoroughly covered data preparation, it’s time to move on to model building.
In this chapter, we’ll have a look at the last major API of Dask: Dask-ML. Just as we’ve seen how Dask DataFrames parallelize Pandas and Dask Arrays parallelize NumPy, Dask-ML is a parallel implementation of scikit-learn. Figure 10.2 shows the relationship between the Dask APIs and the underlying functionality they provide.
Figure 10.2 A review of the API components of Dask