chapter six

Chapter 6. Batch layer

This chapter covers

Computing functions on the batch layer
Splitting a query into precomputed and on-the-fly components
Recomputation versus incremental algorithms
The meaning of scalability
The MapReduce paradigm
A higher-level way of thinking about MapReduce

The goal of a data system is to answer arbitrary questions about your data. Any question you could ask of your dataset can be implemented as a function that takes all of your data as input. Ideally, you could run these functions on the fly whenever you query your dataset. Unfortunately, a function that uses your entire dataset as input will take a very long time to run. You need a different strategy if you want your queries answered quickly.

In the Lambda Architecture, the batch layer precomputes the master dataset into batch views so that queries can be resolved with low latency. This requires striking a balance between what will be precomputed and what will be computed at execution time to complete the query. By doing a little bit of computation on the fly to complete queries, you save yourself from needing to precompute absurdly large batch views. The key is to precompute just enough information so that the query can be completed quickly.

6.1. Motivating examples

Chapter 6. Batch layer

This chapter covers

6.1. Motivating examples

6.2. Computing on the batch layer

6.3. Recomputation algorithms vs. incremental algorithms

6.4. Scalability in the batch layer

6.5. MapReduce: a paradigm for Big Data computing

6.6. Low-level nature of MapReduce

6.7. Pipe diagrams: a higher-level way of thinking about batch computation

6.8. Summary