Chapter 6. Batch layer

 

This chapter covers

  • Computing functions on the batch layer
  • Splitting a query into precomputed and on-the-fly components
  • Recomputation versus incremental algorithms
  • The meaning of scalability
  • The MapReduce paradigm
  • A higher-level way of thinking about MapReduce

The goal of a data system is to answer arbitrary questions about your data. Any question you could ask of your dataset can be implemented as a function that takes all of your data as input. Ideally, you could run these functions on the fly whenever you query your dataset. Unfortunately, a function that uses your entire dataset as input will take a very long time to run. You need a different strategy if you want your queries answered quickly.

In the Lambda Architecture, the batch layer precomputes the master dataset into batch views so that queries can be resolved with low latency. This requires striking a balance between what will be precomputed and what will be computed at execution time to complete the query. By doing a little bit of computation on the fly to complete queries, you save yourself from needing to precompute absurdly large batch views. The key is to precompute just enough information so that the query can be completed quickly.

6.1. Motivating examples

6.2. Computing on the batch layer

6.3. Recomputation algorithms vs. incremental algorithms

6.4. Scalability in the batch layer

6.5. MapReduce: a paradigm for Big Data computing

6.6. Low-level nature of MapReduce

6.7. Pipe diagrams: a higher-level way of thinking about batch computation

6.8. Summary