9 Working with Bags and Arrays
This chapter covers
- Reading, transforming, and analyzing unstructured data using Bags
- Creating Arrays and DataFrames from Bags
- Extracting and filtering data from Bags
- Combining and grouping elements of Bags using fold and reduce functions
- Using NLTK (Natural Language Toolkit) with Bags for text mining on large text datasets
The majority of this book focuses on using DataFrames for analyzing structured data, but our exploration of Dask would not be complete without mentioning the two other high-level Dask APIs: Bags and Arrays. When your data doesn’t fit neatly in a tabular model, Bags and Arrays offer additional flexibility. DataFrames are limited to only two dimensions (rows and columns), but Arrays can have many more. The Array API also offers additional functionality for certain linear algebra, advanced mathematics, and statistics operations. However, much of what’s been covered already through working with DataFrames also applies to working with Arrays—just as Pandas and NumPy have many similarities. In fact, you might recall from chapter 1 that Dask DataFrames are parallelized Pandas DataFrames and Dask Arrays are parallelized NumPy arrays.