part one

Part 1. Batch layer

Part 1 focuses on the batch layer of the Lambda Architecture. Chapters alternate between theory and illustration.

Chapter 2 discusses how you model and schematize the data in your master dataset. Chapter 3 illustrates these concepts using the tool Apache Thrift.

Chapter 4 discusses the requirements for storage of your master dataset. You’ll see that many features typically provided by database solutions are not needed for the master dataset, and in fact get in the way of optimizing master dataset storage. A simpler and less feature-full storage solution meets the requirements better. Chapter 5 illustrates practical storage of a master dataset using the Hadoop Distributed Filesystem.

Chapter 6 discusses computing arbitrary functions on your master dataset using the MapReduce paradigm. MapReduce is general enough to compute any scalable function. Although MapReduce is powerful, you’ll see that higher-level abstractions make it far easier to use. Chapter 7 shows a powerful high-level abstraction to MapReduce called JCascalog.

To connect all the concepts together, chapters 8 and 9 implement the complete batch layer for the running example SuperWebAnalytics.com. Chapter 8 shows the overall architecture and algorithms, while chapter 9 shows the working code in all its details.