9 Working with Bags and Arrays

 

This chapter covers

  • Reading, transforming, and analyzing unstructured data using Bags
  • Creating Arrays and DataFrames from Bags
  • Extracting and filtering data from Bags
  • Combining and grouping elements of Bags using fold and reduce functions
  • Using NLTK (Natural Language Toolkit) with Bags for text mining on large text datasets

The majority of this book focuses on using DataFrames for analyzing structured data, but our exploration of Dask would not be complete without mentioning the two other high-level Dask APIs: Bags and Arrays. When your data doesn’t fit neatly in a tabular model, Bags and Arrays offer additional flexibility. DataFrames are limited to only two dimensions (rows and columns), but Arrays can have many more. The Array API also offers additional functionality for certain linear algebra, advanced mathematics, and statistics operations. However, much of what’s been covered already through working with DataFrames also applies to working with Arrays—just as Pandas and NumPy have many similarities. In fact, you might recall from chapter 1 that Dask DataFrames are parallelized Pandas DataFrames and Dask Arrays are parallelized NumPy arrays.

9.1 Reading and parsing unstructured data with Bags

 
 
 

9.1.1 Selecting and viewing data from a Bag

 
 
 

9.1.2 Common parsing issues and how to overcome them

 

9.1.3 Working with delimiters

 
 
 

9.2 Transforming, filtering, and folding elements

 
 
 
 

9.2.1 Transforming elements with the map method

 
 
 

9.2.2 Filtering Bags with the filter method

 
 

9.2.3 Calculating descriptive statistics on Bags

 
 

9.2.4 Creating aggregate functions using the foldby method

 

9.3 Building Arrays and DataFrames from Bags

 
 

9.4 Using Bags for parallel text analysis with NLTK

 

9.4.1 The basics of bigram analysis

 
 

9.4.2 Extracting tokens and filtering stopwords

 
 

9.4.3 Analyzing the bigrams

 
 
 

Summary

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage