chapter three

Chapter 3. Data model for Big Data: Illustration

This chapter covers

Apache Thrift
Implementing a graph schema using Apache Thrift
Limitations of serialization frameworks

In the last chapter you saw the principles of forming a data model—the value of raw data, dealing with semantic normalization, and the critical importance of immutability. You saw how a graph schema can satisfy all these properties and saw what the graph schema looks like for SuperWebAnalytics.com.

This is the first of the illustration chapters, in which we demonstrate the concepts of the previous chapter using real-world tools. You can read just the theory chapters of the book and learn the whole Lambda Architecture, but the illustration chapters show you the nuances of mapping the theory to real code. In this chapter we’ll implement the SuperWebAnalytics.com data model using Apache Thrift, a serialization framework. You’ll see that even in a task as straightforward as writing a schema, there is friction between the idealized theory and what you can achieve in practice.

3.1. Why a serialization framework?

Many developers go down the path of writing their raw data in a schemaless format like JSON. This is appealing because of how easy it is to get started, but this approach quickly leads to problems. Whether due to bugs or misunderstandings between different developers, data corruption inevitably occurs. It’s our experience that data corruption errors are some of the most time-consuming to debug.

Chapter 3. Data model for Big Data: Illustration

This chapter covers

3.1. Why a serialization framework?

3.2. Apache Thrift

3.3. Limitations of serialization frameworks

3.4. Summary