Preface
When I first entered the world of Big Data, it felt like the Wild West of software development. Many were abandoning the relational database and its familiar comforts for NoSQL databases with highly restricted data models designed to scale to thousands of machines. The number of NoSQL databases, many of them with only minor differences between them, became overwhelming. A new project called Hadoop began to make waves, promising the ability to do deep analyses on huge amounts of data. Making sense of how to use these new tools was bewildering.
At the time, I was trying to handle the scaling problems we were faced with at the company at which I worked. The architecture was intimidatingly complex—a web of sharded relational databases, queues, workers, masters, and slaves. Corruption had worked its way into the databases, and special code existed in the application to handle the corruption. Slaves were always behind. I decided to explore alternative Big Data technologies to see if there was a better design for our data architecture.
One experience from my early software-engineering career deeply shaped my view of how systems should be architected. A coworker of mine had spent a few weeks collecting data from the internet onto a shared filesystem. He was waiting to collect enough data so that he could perform an analysis on it. One day while doing some routine maintenance, I accidentally deleted all of my coworker’s data, setting him behind weeks on his project.