chapter eleven

Chapter 11. Hive and the Hadoop herd

This chapter covers

What Hive is
Setting up Hive
Using Hive for data warehousing
Other software packages related to Hadoop

As powerful as Hadoop is, it doesn’t offer everything for everybody. Many projects have sprung up to extend Hadoop for specific purposes. The most prominent and well-supported ones have officially become subprojects under the umbrella of the Apache Hadoop project.^[1] These subprojects include

¹ What we’ve referred to in this book as “Hadoop” so far (HDFS and MapReduce) is technically called the “Hadoop Core” subproject of Apache Hadoop, although colloquially people tend to call it Hadoop.

Pig— A high-level data flow language
Hive— A SQL-like data warehouse infrastructure
HBase— A distributed, column-oriented database modeled after Google’s Bigtable
ZooKeeper— A reliable coordination system for managing shared state between distributed applications
Chukwa— A data collection system for managing large distributed systems

We covered Pig in detail in chapter 10, and we’ll learn about Hive in this chapter. Furthermore, section 11.2 will briefly describe other Hadoop-related projects. Some of these aren’t associated with Apache (e.g., Cascading, CloudBase). Some are in their nascent stages (e.g., Hama, Mahout). You’ll see some of these tools in action in the case studies of chapter 12.

Chapter 11. Hive and the Hadoop herd

This chapter covers

11.1. Hive

11.2. Other Hadoop-related stuff

11.3. Summary