chapter five

Chapter 5. Data storage on the batch layer: Illustration

This chapter covers

Using the Hadoop Distributed File System (HDFS)
Pail, a higher-level abstraction for manipulating datasets

In the last chapter you saw the requirements for storing a master dataset and how a distributed filesystem is a great fit for those requirements. But you also saw how using a filesystem API directly felt way too low-level for the kinds of operations you need to do on the master dataset. In this chapter we’ll show you how to use a specific distributed filesystem—HDFS—and then show how to automate the tasks you need to do with a higher-level API.

Like all illustration chapters, we’ll focus on specific tools to show the nitty-gritty of applying the higher-level concepts of the previous chapter. As always, our goal is not to compare and contrast all the possible tools but to reinforce the higher-level concepts.

5.1. Using the Hadoop Distributed File System

You’ve already learned the basics of how HDFS works. Let’s quickly review those:

Files are split into blocks that are spread among many nodes in the cluster.
Blocks are replicated among many nodes so the data is still available even when machines go down.
The namenode keeps track of the blocks for each file and where those blocks are stored.

Chapter 5. Data storage on the batch layer: Illustration

This chapter covers

5.1. Using the Hadoop Distributed File System

5.2. Data storage in the batch layer with Pail

5.3. Storing the master dataset for SuperWebAnalytics.com

5.4. Summary