Chapter 4. Organizing and optimizing data in HDFS

 

This chapter covers

  • Tips for laying out and organizing your data
  • Data access patterns to optimize reading and writing your data
  • The importance of compression, and choosing the best codec for your needs

In the previous chapter, we looked at how to work with different file formats in MapReduce and which ones were ideally suited for storing your data. Once you’ve honed in on the data format that you’ll be using, it’s time to start thinking about how you’ll organize your data in HDFS. It’s important that you give yourself enough time early on in the design of your Hadoop system to understand how your data will be accessed so that you can optimize for the more important use cases that you’ll be supporting.

There are numerous factors that will impact your data organization decisions, such as whether you’ll need to provide SQL access to your data (likely, you will), which fields will be used to look up the data, and what access-time SLAs you’ll need to support. At the same time, you need to make sure that you don’t apply unnecessary heap pressure on the HDFS NameNode with a large number of small files, and you also need to learn how to work with huge input datasets.

4.1. Data organization

4.2. Efficient storage with compression

4.3. Chapter summary