Chapter 7. Big Data stack

 

This chapter covers

  • Adding reliability to the data store from chapter 6
  • Managing a distributed persistent data store in CoreOS
  • Simulating failures in the data system

In this chapter, you’ll build a Big Data aggregation platform that seeds a database with random search queries against Twitter. You’ll build a small corpus of data, make Twitter rate-limit you (while still being a good API citizen), and see how to take care of your mission-critical (although random) data. Your application will function like this:

  1. Six stateless workers will generate a random word and search for it on the Twitter API.
  2. The results will be stored in Couchbase.
  3. Workers will continue to search every 100 ms in parallel until they’re rate limited.
  4. Once they’re rate limited, they’ll set a distributed lock in etcd with a 15-minute TTL.
  5. All workers will fast-exit on the presence of that lock.
  6. When the lock expires, workers will start over at step 1.

This will be an evolution of the application from chapter 6, so you must finish that project first. You’re moving to a distributed data system that will give you more performance, greater capacity, and higher availability. You’re also moving to a data source that will allow you to have multiple simultaneous connections to it, unlike the Meetup.com stream. This lets you play with a swarm of workers and control them with a distributed lock.

7.1. Scope of this chapter’s example

7.2. New stack components

7.3. Breaking your stack

7.4. Summary