7 Processing data

 

This chapter covers

  • Accessing large-amounts of cloud-based data quickly
  • Using Apache Arrow for efficient, in-memory data processing
  • Leveraging SQL-based query engines to preprocess data for workflows
  • Encoding features for models at scale

The past five chapters covered how to take data science projects from prototype to production. We have learned how to build workflows, use them to run computationally demanding tasks in the cloud, and deploy the workflows to a production scheduler. Now that we have a crisp idea of the prototyping loop and interaction with production deployments, we can return to the fundamental question: how should the workflows consume and produce data?

Interfacing with data is a key concern of all data science applications. Every application needs to find and read input data that is stored somewhere. Often, the application is required to write its outputs, such as fresh predictions, to the same system. Although a huge amount of variation exists among systems for managing data, in this context we use a common moniker, data warehouse, to refer to all of them. Given the foundational nature of data inputs and outputs, it feels appropriate to place the concern at the very bottom of the stack, as depicted in figure 7.1.

Figure 7.1 The stack of effective data science infrastructure
CH07_F01_Tuulos

7.1 Foundations of fast data

7.1.1 Loading data from S3

7.1.2 Working with tabular data

7.1.3 The in-memory data stack

7.2 Interfacing with data infrastructure

7.2.1 Modern data infrastructure

7.2.2 Preparing datasets in SQL

7.2.3 Distributed data processing

7.3 From data to features

7.3.1 Distinguishing facts and features

7.3.2 Encoding features

sitemap