Storm Applied: Strategies for real-time event processing cover
welcome to this free extract from
an online version of the Manning book.
to read more
or

Foreword

 

“Backend rewrites are always hard.”

That’s how ours began, with a simple statement from my brilliant and trusted colleague, Keith Bourgoin. We had been working on the original web analytics backend behind Parse.ly for over a year. We called it “PTrack”.

Parse.ly uses Python, so we built our systems atop comfortable distributed computing tools that were handy in that community, such as multiprocessing and celery. Despite our mastery of these, it seemed like every three months, we’d double the amount of traffic we had to handle and hit some other limitation of those systems. There had to be a better way.

So, we started the much-feared backend rewrite. This new scheme to process our data would use small Python processes that communicated via ZeroMQ. We jokingly called it “PTrack3000,” referring to the “Python3000” name given to the future version of Python by the language’s creator, when it was still a far-off pipe dream.

By using ZeroMQ, we thought we could squeeze more messages per second out of each process and keep the system operationally simple. But what this setup gained in operational ease and performance, it lost in data reliability.

Then, something magical happened. BackType, a startup whose progress we had tracked in the popular press,[1] was acquired by Twitter. One of the first orders of business upon being acquired was to publicly release its stream processing framework, Storm, to the world.