Chapter 9. SQL on Hadoop

 

This chapter covers

  • Learning the Hadoop specifics of Hive, including user-defined functions and performance-tuning tips
  • Learning about Impala and how you can write user-defined functions
  • Embedding SQL in your Spark code to intertwine the two languages and play to their strengths

Let’s say that it’s nine o’clock in the morning and you’ve been asked to generate a report on the top 10 countries that generated visitor traffic over the last month. And it needs to be done by noon. Your log data is sitting in HDFS ready to be used. Are you going to break out your IDE and start writing Java MapReduce code? Not likely. This is where high-level languages such as Hive, Impala, and Spark come into play. With their SQL syntax, Hive and Impala allow you to write and start executing queries in the same time that it would take you to write your main method in Java.

The big advantage of Hive is that it no longer requires MapReduce to execute queries—as of Hive 0.13, Hive can use Tez, which is a general DAG-execution framework that doesn’t impose the HDFS and disk barriers between successive steps as MapReduce does. Impala and Spark were also built from the ground up to not use MapReduce behind the scenes.

9.1. Hive

9.2. Impala

9.3. Spark SQL

9.4. Chapter summary