Chapter 4. Data storage on the batch layer

This chapter covers

Storage requirements for the master dataset
Distributed filesystems
Improving efficiency with vertical partitioning

In the last two chapters you learned about a data model for the master dataset and how you can translate that data model into a graph schema. You saw the importance of making data immutable and eternal. The next step is to learn how to physically store that data in the batch layer. Figure 4.1 recaps where we are in the Lambda Architecture.

Figure 4.1. The batch layer must structure large, continually growing datasets in a manner that supports low maintenance as well as efficient creation of the batch views.

Like the last two chapters, this chapter is dedicated to the master dataset. The master dataset is typically too large to exist on a single server, so you must choose how you’ll distribute your data across multiple machines. The way you store your master dataset will impact how you consume it, so it’s vital to devise your storage strategy with your usage patterns in mind.

In this chapter you’ll do the following:

Determine the requirements for storing the master dataset
See why distributed filesystems are a natural fit for storing a master dataset
See how the batch layer storage for the SuperWebAnalytics.com project maps to distributed filesystems

We’ll begin by examining how the role of the batch layer within the Lambda Architecture affects how you should store your data.

Chapter 4. Data storage on the batch layer

This chapter covers

Figure 4.1. The batch layer must structure large, continually growing datasets in a manner that supports low maintenance as well as efficient creation of the batch views.

4.1. Storage requirements for the master dataset

4.2. Choosing a storage solution for the batch layer

4.3. How distributed filesystems work

4.4. Storing a master dataset with a distributed filesystem

4.5. Vertical partitioning

4.6. Low-level nature of distributed filesystems

4.7. Storing the SuperWebAnalytics.com master dataset on a distributed filesystem

4.8. Summary