The past five chapters covered how to take data science projects from prototype to production. We have learned how to build workflows, use them to run computationally demanding tasks in the cloud, and deploy the workflows to a production scheduler. Now that we have a crisp idea of the prototyping loop and interaction with production deployments, we can return to the fundamental question: how should the workflows consume and produce data?
Interfacing with data is a key concern of all data science applications. Every application needs to find and read input data that is stored somewhere. Often, the application is required to write its outputs, such as fresh predictions, to the same system. Although a huge amount of variation exists among systems for managing data, in this context we use a common moniker, data warehouse, to refer to all of them. Given the foundational nature of data inputs and outputs, it feels appropriate to place the concern at the very bottom of the stack, as depicted in figure 7.1.