The idea for this book came to me in the summer of 2018 after working with some especially talented developers who had managed to go a significant portion of their careers without learning how to write scalable code. I realized then that a lot of the techniques for “big data” work, or what we’ll refer to in this book as “large dataset” problems, are reserved for those who want to tackle these problems exclusively. Because a lot of these problems occur in enterprise environments, where the mechanisms to produce data at this scale are ripe, books about this topic tend to be written in the same enterprise languages as the tools, such as Java.
This book is a little different. I’ve noticed that large dataset problems are increasingly being tackled in a distributed manner. Not distributed in the terms of distributed computing—though certainly that as well—but distributed in terms of who’s doing the work. Individual developers or small development teams, often working in rapid prototyping environments or with rapid development languages (such as Python), are now working with large datasets.