Chapter 7. Batch layer: Illustration

 

This chapter covers

  • Sources of complexity in data-processing code
  • JCascalog as a practical implementation of pipe diagrams
  • Applying abstraction and composition techniques to data processing

In the last chapter you saw how pipe diagrams are a natural and concise way to specify computations that operate over large amounts of data. You saw that pipe diagrams can be executed as a series of MapReduce jobs for parallelism and scalability.

In this illustration chapter, we’ll look at a tool that’s a fairly direct mapping of pipe diagrams: JCascalog. There’s a lot to cover in JCascalog, so this chapter is a lot more involved than the previous illustration chapters. Like always, you can still learn the full theory of the Lambda Architecture without reading the illustration chapters. But with JCascalog, in particular, we aim to open your minds as to what is possible with data-processing tools. A key point is that your data-processing code is no different than any other code you write. As such, it requires good abstractions that are reusable and composable. Abstraction and composition are the cornerstones of good software engineering.

7.1. An illustrative example

7.2. Common pitfalls of data-processing tools

7.3. An introduction to JCascalog

7.4. Composition

7.5. Summary