15. Aggregating your data

 

This chapter covers

  • Refreshing your knowledge of aggregations
  • Performing basic aggregations
  • Using live data to perform aggregations
  • Building custom aggregations

Aggregating is a way to group data so you can view it at a macro level rather than an atomic, or micro, level. Aggregations are an essential step to better analytics, and down the road from machine learning and artificial intelligence.

In this chapter, you will start slow, with a small reminder of what aggregations are. Then you’ll perform basic aggregations with Spark. You will be using both Spark SQL and the dataframe API.

Once you go through the basics, you will analyze open data from New York City public schools. You will study attendance, absenteeism, and more through aggregations. Of course, prior to this real-life scenario, you will have to onboard (a synonym of ingest ) the data, clean it, and prepare it for the aggregations.

Finally, when the standard aggregations do not suffice, you will need to write your own. This is what you will be doing in section 15.3. You will build a user-defined aggregation function (UDAF), a custom function to perform your unique aggregation.

Lab

Examples from this chapter are available in GitHub at https://github .com/jgperrin/net.jgp.books.spark.ch15.

15.1 Aggregating data with Spark

15.1.1 A quick reminder on aggregations

15.1.2 Performing basic aggregations with Spark

15.2 Performing aggregations with live data

15.2.1 Preparing your dataset

15.2.2 Aggregating data to better understand the schools

15.3 Building custom aggregations with UDAFs

Summary