chapter fifteen

15 Aggregating your data

This chapter covers

Refreshing your knowledge on aggregations.
Performing basic aggregations.
Using live data to perform aggregations.
Building custom aggregations.

Aggregating data is a way to group data so you can view data at a macro-level rather than an atomic or micro level. Aggregations are an essential step to better analytics, and down the road to machine learning and artificial intelligence.

In this chapter, you will start slow, with a small reminder on what aggregations are and perform basic aggregations with Spark. You will be using both Spark SQL and the dataframe API.

Once you go through the basics, you will analyze the open data from the New York City public schools. You will study attendance, absenteeism, and more through aggregations. Of course, prior to this real-life scenario, you will have to onboard (a synonym of ingest) the data, clean it, and prepare it for the aggregations.

Finally, when the standard aggregations do not suffice, you will need to write your own. This is what you will be doing in section 15.3. You will build a user-defined aggregation function (UDAF), a custom function to perform your unique aggregation.

Lab

Examples from this chapter are available in GitHub at https://github.com/jgperrin/net.jgp.books.spark.ch15.

15 Aggregating your data

This chapter covers

Lab

15.1 Aggregating data with Spark

15.1.1 A quick reminder on aggregations

15.1.2 Performing basic aggregations with Spark

15.2 Performing aggregations with live data

15.2.1 Preparing your dataset

15.2.2 Aggregating data to better understand the schools

15.3 Building custom aggregations with UDAF

15.4 Summary