Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code cover
welcome to this free extract from
an online version of the Manning book.
to read more

About this book


Who should read this book

The goal of this book is to teach a scalable style of programming. To do that, we’ll cover a wider range of material than you might be familiar with from other programming or technology books. Where other books might cover a single library, this book covers many libraries—both built-in modules, such as functools and itertools, as well as third-party libraries, such as toolz, pathos, and mrjob. Where other books cover just one technology, this book covers many, including Hadoop, Spark, and Amazon Web Services (AWS). The choice to cover a broad range of technologies is admitting the fact that to scale your code, you need to be able to adapt to new situations. Across all the technologies, however, I emphasize a “map and reduce” style of programming in Python.

You’ll find that this style is a constant throughout the changing environment in which your code is running, which is why I adopted it in the first place. You can use it to rapidly adapt your code to new situations. Ultimately, the book aims to teach you how to scale your code by authoring it in a map and reduce style. Along the way, I also aim to teach you the tools of the trade for big data work, such as Spark, Hadoop, and AWS.

How this book is organized: A roadmap

About the code

liveBook discussion forum

About the author

About the cover illustration