Chapter 20. Cloud Dataflow: large-scale data processing

 

This chapter covers

  • What do we mean by data processing?
  • What is Apache Beam?
  • What is Cloud Dataflow?
  • How can you use Apache Beam and Cloud Dataflow together to process large sets of data?

You’ve probably heard the term data processing before, likely meaning something like “taking some data and transforming it somehow.” More specifically, when we talk about data processing, we tend to mean taking a lot of data (measured in GB at least), potentially combining it with other data, and ending with either an enriched data set of similar size or a smaller data set that summarizes some aspects of the huge pile of data. For example, imagine you had all of your email history in one big pile, and all of your contact information (email addresses and birthdays) in another big pile. Using this idea of data processing, you might be able to join those two piles together based on the email addresses. Once you did that, you could then filter that data down to find only emails that were sent on someone’s birthday (figure 20.1).

Figure 20.1. Using data processing to combine sets of data for further filtering

20.1. What is Apache Beam?

20.2. What is Cloud Dataflow?

20.3. Interacting with Cloud Dataflow

20.4. Understanding pricing

Summary