Data Science with Python and Dask cover
welcome to this free extract from
an online version of the Manning book.
to read more
or

about this book

published book

Who should read this book

Data Science with Python and Dask takes you on a hands-on journey through a typical data science workflow—from data cleaning through deployment—using Dask. The book begins by presenting some foundational knowledge of scalable computing and explains how Dask takes advantage of those concepts to operate on datasets big and small. Building on that foundation, it then turns its focus to preparing, analyzing, visualizing, and modeling various real-world datasets to give you tangible examples of how to use Dask to perform common data science tasks. Finally, the book ends with a step-by-step walkthrough of deploying your very own Dask cluster on AWS to scale out your analysis code.

Data Science with Python and Dask was primarily written with beginner to intermediate data scientists, data engineers, and analysts in mind, specifically those who have not yet mastered working with datasets that push the limits of a single machine. While prior experience with other distributed frameworks (such as PySpark) is not necessary, readers who have such experience can also benefit from this book by being able to compare the capabilities and ergonomics of Dask. There are various articles and documentation available online, but none are focused specifically on using Dask for data science in such a comprehensive manner as this book.

How this book is organized: A roadmap

This book has three sections that cover 11 chapters.

Part 1 lays some foundational knowledge about scalable computing and provides a few simple examples of how Dask uses these concepts to scale out workloads.

  • Chapter 1 introduces Dask by building a case for why it’s an important tool to have in your data science toolkit. It also introduces and explains directed acyclic graphs (DAGs), a core concept for scalable computing that’s central to Dask’s architecture.
  • Chapter 2 ties what you learned conceptually about DAGs in chapter 1 to how Dask uses DAGs to distribute work across multiple CPU cores and even physical machines. It goes over how to visualize the DAGs automatically generated by the task scheduler, and how the task scheduler divides up resources to efficiently process data.

Part 2 covers common data cleaning, analysis, and visualization tasks with structured data using the Dask DataFrame construct.

  • Chapter 3 describes the conceptual design of Dask DataFrames and how they abstract and parallelize Pandas DataFrames.
  • Chapter 4 discusses how to create Dask DataFrames from various data sources and formats, such as text files, databases, S3, and Parquet files.
  • Chapter 5 is a deep dive into using DataFrames to clean and transform datasets. It covers sorting, filtering, dealing with missing values, joining datasets, and writing DataFrames in several file formats.
  • Chapter 6 covers using built-in aggregate functions (such as sum, mean, and so on), as well as writing your own aggregate and window functions. It also discusses how to produce basic descriptive statistics.
  • Chapter 7 steps through creating basic visualizations, such as pairplots and heatmaps.
  • Chapter 8 builds on chapter 7 and covers advanced visualizations with interactivity and geographic features.

Part 3 covers advanced topics in Dask, such as unstructured data, machine learning, and building scalable workloads.

  • Chapter 9 demonstrates how to parse, clean, and analyze unstructured data using Dask Bags and Arrays.
  • Chapter 10 shows how to build machine learning models from Dask data sources, as well as testing and persisting trained models.
  • Chapter 11 completes the book by walking through how to set up a Dask cluster on AWS using Docker.

You can either opt to read the book sequentially if you prefer a step-by-step narrative or skip around if you are interested in learning how to perform specific tasks. Regardless, you should read chapters 1 and 2 to form a good understanding of how Dask is able to scale out workloads from multiple CPU cores to multiple machines. You should also reference the appendix for specific information on setting up Dask and some of the other packages used in the text.

About the code

A primary way this book teaches the material is by providing hands-on examples on real-world datasets. As such, there are many numbered code listings throughout the book. While there is no code in line with the rest of the text, at times a variable or method name that appears in a numbered code listing is referenced for explanation. These are differentiated by using this text style wherever references are made. Many code listings also contain annotations to further explain what the code means.

All the code is available in Jupyter Notebooks and can be downloaded at www.manning.com/books/data-science-at-scale-with-python-and-dask. Each notebook cell relates to one of the numbered code listings and is presented in order of how the listings appear in the book.

liveBook discussion forum

Purchase of Data Science with Python and Dask includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum, go to https://livebook.manning.com/book/data-science-with-python-and-dask. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/#!/discussion.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

Get Data Science with Python and Dask
buy ebook for  $39.99 $27.99
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage