11. Working with SQL

 

This chapter covers

  • Using SQL within Spark
  • Determining the local or global scope of your views
  • Mixing both the dataframe API and SQL
  • Deleting records in a dataframe

Structured Query Language ( SQL ) is the golden standard for manipulating data. Introduced in 1974, it has since evolved to become an ISO standard (ISO/IEC 9075). The latest revision is SQL:2016.

It seems that SQL has been around forever as a way to extract and manipulate data in relational databases. And SQL will be around forever. When I was in college, I clearly remember asking my database professor, “Who do you expect will use SQL? A secretary making a report?” His answer was simply, “Yes.” (Based on that answer, I might just figure that you are a secretary who wants to use Spark.)

I realized that SQL was becoming a powerful tool when, a few months later, I used it with Oracle Pro*C. Pro*C is an embedded SQL programming language allowing you to embed SQL in your C applications. Fast-forward to more recent technologies such as Java and JDBC, and you can still contemplate the massive presence SQL has. SQL is still filling your JDBC RecordSet .

Based on SQL’s popularity and widespread usage, embedding SQL in Spark makes complete sense, especially because you are manipulating structured and semistructured data. In this chapter, you will use SQL with Spark and Java. I will not teach SQL; if your SQL is a little rusty, you will still be able to follow the examples.

11.1 Working with Spark SQL

11.2 The difference between local and global views

11.3 Mixing the dataframe API and Spark SQL

11.4 Don’t DELETE it!

11.5 Going further with SQL

Summary