Chapter 5. Sparkling queries with Spark SQL


This chapter covers

  • Creating DataFrames
  • Using the DataFrame API
  • Using SQL queries
  • Loading and saving data from/to external data sources
  • Understanding the Catalyst optimizer
  • Understanding Tungsten performance improvements
  • Introducing DataSets

You had a taste of working with DataFrames in chapter 3. As you saw there, DataFrames let you work with structured data (data organized in rows and columns, where each column contains only values of a certain type). SQL, frequently used in relational databases, is the most common way to organize and query this data. SQL also figures as part of the name of the first Spark component we’re covering in part 2: Spark SQL.

In this chapter, we plunge deeper into the DataFrame API and examine it more closely. In section 5.1, you’ll first learn how to convert RDDs to DataFrames. You’ll then use the DataFrame API on a sample dataset from the Stack Exchange website to select, filter, sort, group, and join data. We’ll show you all that you need to know about using SQL functions with DataFrames and how to convert DataFrames back to RDDs.

5.1. Working with DataFrames

5.2. Beyond DataFrames: introducing DataSets

5.3. Using SQL commands

5.4. Saving and loading DataFrame data

5.5. Catalyst optimizer

5.6. Performance improvements with Tungsten

5.7. Summary