chapter fifteen

15 Aggregating your data

 

This chapter covers

  • Refreshing your knowledge on aggregations.
  • Performing basic aggregations.
  • Using live data to perform aggregations.
  • Building custom aggregations.

Aggregating data is a way to group data so you can view data at a macro-level rather than an atomic or micro level. Aggregations are an essential step to better analytics, and down the road to machine learning and artificial intelligence.

In this chapter, you will start slow, with a small reminder on what aggregations are and perform basic aggregations with Spark. You will be using both Spark SQL and the dataframe API.

Once you go through the basics, you will analyze the open data from the New York City public schools. You will study attendance, absenteeism, and more through aggregations. Of course, prior to this real-life scenario, you will have to onboard (a synonym of ingest) the data, clean it, and prepare it for the aggregations.

Finally, when the standard aggregations do not suffice, you will need to write your own. This is what you will be doing in section 15.3. You will build a user-defined aggregation function (UDAF), a custom function to perform your unique aggregation.

Lab

Examples from this chapter are available in GitHub at https://github.com/jgperrin/net.jgp.books.spark.ch15.