What are the most fundamental building blocks of all data science projects? First, by definition, data science projects use data. At least small amounts of data are needed by all machine learning and data science projects. Second, the science part of data science implies that we don’t merely collect data but we use it for something, that is, we compute something using data. Correspondingly, data and compute are the two most foundational layers of our data science infrastructure stack, depicted in figure 4.1.
Managing and accessing data is such a deep and broad topic that we postpone an in-depth discussion about it until chapter 7. In this chapter, we focus on the compute layer of the stack, which answers a seemingly simple question: After a data scientist has defined a piece of code, such as a step in a workflow, where should we execute it?