1 Software engineering principles

 

This chapter covers

  • What data scientists need to know about software engineering
  • Components of a data pipeline
  • Deploying models with machine learning pipelines

Suppose you’re collaborating on a software project with a team that includes data scientists, software engineers, and other technical and non-technical roles. How do you handle modifying the same code files? What about testing out new features or modeling techniques? What’s the best way to track these experiments or to revert changes? Software engineers and data scientists may have very different answers to these questions.

Data scientists frequently use tools like Jupyter Notebook, which allows you to write code and view its results in a single integrated environment. Jupyter Notebooks are easy to create, use, and share, in particular because they allow you to show charts or other visuals alongside the related code, data, and text descriptions. This is generally because part of being a data scientist is experimentation and exploration - trying out various ideas, creating visualizations, and searching for answers in data. However, as projects become more complex and involve more contributors, these notebook files can get messy very quickly.

1.1 What is software engineering?

1.2 What do data scientists need to know about software engineering?

1.2.1 Better structured code

1.2.2 Improving coding collaboration

1.2.3 Scaling your code to handle more data efficiently

1.2.4 Effectively testing code to reduce future issues

1.2.5 Putting models into production to make them usable by others

1.2.6 Recap of engineering principles

1.2.7 Sample data science workflow

1.2.8 How does software engineering come into the picture?

1.3 Components of a data pipeline

1.3.1 Real-world example: Building a model to predict customer churn

1.4 Deploying models with machine learning pipelines

1.4.1 Data ingestion

1.4.2 Pre-processing

1.4.3 Model training

1.4.4 Model evaluation

1.4.5 Model prediction / deployment

1.4.6 Model monitoring

1.5 Summary