Chapter 5. Data storage on the batch layer: Illustration
This chapter covers
- Using the Hadoop Distributed File System (HDFS)
- Pail, a higher-level abstraction for manipulating datasets
In the last chapter you saw the requirements for storing a master dataset and how a distributed filesystem is a great fit for those requirements. But you also saw how using a filesystem API directly felt way too low-level for the kinds of operations you need to do on the master dataset. In this chapter we’ll show you how to use a specific distributed filesystem—HDFS—and then show how to automate the tasks you need to do with a higher-level API.
Like all illustration chapters, we’ll focus on specific tools to show the nitty-gritty of applying the higher-level concepts of the previous chapter. As always, our goal is not to compare and contrast all the possible tools but to reinforce the higher-level concepts.
- Files are split into blocks that are spread among many nodes in the cluster.
- Blocks are replicated among many nodes so the data is still available even when machines go down.
- The namenode keeps track of the blocks for each file and where those blocks are stored.