This chapter covers
Aggregating is a way to group data so you can view it at a macro level rather than an atomic, or micro, level. Aggregations are an essential step to better analytics, and down the road from machine learning and artificial intelligence.
In this chapter, you will start slow, with a small reminder of what aggregations are. Then you’ll perform basic aggregations with Spark. You will be using both Spark SQL and the dataframe API.
Once you go through the basics, you will analyze open data from New York City public schools. You will study attendance, absenteeism, and more through aggregations. Of course, prior to this real-life scenario, you will have to onboard (a synonym of ingest ) the data, clean it, and prepare it for the aggregations.
Finally, when the standard aggregations do not suffice, you will need to write your own. This is what you will be doing in section 15.3. You will build a user-defined aggregation function (UDAF), a custom function to perform your unique aggregation.
Lab
Examples from this chapter are available in GitHub at https://github .com/jgperrin/net.jgp.books.spark.ch15.