chapter eight

8 Alternatives to Pandas

This chapter covers

Parallel machine learning training with Dask
Exploratory data analysis and processing with PySpark
Machine learning models training with PySpark
Other alternatives to pandas

In the last two chapters, we’ve discussed scaling Python code in terms of both computational speed and dealing with memory challenges. However, many of the examples in Chapters 6 and 7 still rely largely on standard data science packages like pandas or sklearn. In this chapter, we will focus on taking scaling to the next level by learning about packages that can distribute many of the tasks in pandas or scikit-learn across either a single machine or a cluster of machines. This will allow for both computational speed improvements in terms of parallelization, and also the ability to process very large datasets by scaling the appropriate number of machines you need by the size of your data. The two primary packages we’ll focus on are Dask and PySpark. We will also touch on a few others near the end of the chapter. There are several reasons why we need to consider moving beyond standard pandas or scikit-learn:

8.1 Dask

8.1.1 Exploratory data analysis with Dask

8.1.2 Creating new features and random sampling with Dask

8.1.3 Training a model with Dask locally

8.1.4 Training a machine learning model on a remote Dask cluster

8.1.5 Summarizing Dask

8.2 PySpark

8.2.1 Setting up PySpark

8.2.2 Reading in data with PySpark

8.2.3 Exploratory data analysis with PySpark

8.2.4 Creating new columns with PySpark

8.2.5 Randomly sample large datasets with PySpark

8.2.6 Training a machine learning model with PySpark

8.3 Other alternatives to pandas

8.3.1 Using modin to parallelize pandas

8.3.2 Ray, Polars, and beyond

8.4 Practice on your own

8.5 Summary