2 The toolchain of data science

 

This chapter covers

  • The key activities that the data scientist engages in on a daily basis
  • The essential toolchain that makes the data scientist productive
  • The role of workflows in the infrastructure stack

Every profession has its tools of the trade. If you are a carpenter, you need saws, rulers, and chisels. If you are a dentist, you need mirrors, drills, and syringes. If you are a data scientist, what are the essential tools that you need in your daily job?

Obviously, you need a computer. But what’s the purpose of the computer? Should it be used to run heavy computation, train models, and such, or should it be just a relatively dumb terminal for typing code and analyzing results? Because production applications execute outside personal laptops, maybe prototyping should happen as close to the real production environment as possible, too. Answering questions like this can be surprisingly nontrivial, and the answers can have deep implications for the whole infrastructure stack.

2.1 Setting up a development environment

2.1.1 Cloud account

2.1.2 Data science workstation

2.1.3 Notebooks

2.1.4 Putting everything together

2.2 Introducing workflows

2.2.1 The basics of workflows

2.2.2 Executing workflows

2.2.3 The world of workflow frameworks

Summary