Chapter 2. Data model for Big Data

This chapter covers

Properties of data
The fact-based data model
Benefits of a fact-based model for Big Data
Graph schemas

In the last chapter you saw what can go wrong when using traditional tools for building data systems, and we went back to first principles to derive a better design. You saw that every data system can be formulated as computing functions on data, and you learned the basics of the Lambda Architecture, which provides a practical way to implement an arbitrary function on arbitrary data in real time.

At the core of the Lambda Architecture is the master dataset, which is highlighted in figure 2.1. The master dataset is the source of truth in the Lambda Architecture. Even if you were to lose all your serving layer datasets and speed layer datasets, you could reconstruct your application from the master dataset. This is because the batch views served by the serving layer are produced via functions on the master dataset, and since the speed layer is based only on recent data, it can construct itself within a few hours.

Figure 2.1. The master dataset in the Lambda Architecture serves as the source of truth for your Big Data system. Errors at the serving and speed layers can be corrected, but corruption of the master dataset is irreparable.

Chapter 2. Data model for Big Data

This chapter covers

Figure 2.1. The master dataset in the Lambda Architecture serves as the source of truth for your Big Data system. Errors at the serving and speed layers can be corrected, but corruption of the master dataset is irreparable.

2.1. The properties of data

2.2. The fact-based model for representing data

2.3. Graph schemas

2.4. A complete data model for SuperWebAnalytics.com

2.5. Summary