2 Source control for data scientists

 

This chapter covers

  • What is source control
  • Tool for source control (Git)
  • Git workflow from scratch
  • Handling conflicts and merges with Git
  • Comparing Jupyter Notebook files with nbdime

In the last chapter, we introduced several key software engineering concepts that will improve your life as a data scientist. One of these key concepts is source control, which we’re going to focus on for this chapter. Source control (also called version control) is basically a way of tracking changes made to a codebase. As the number and size of codebases has grown significantly over the years, the need for monitoring code changes and making it easier for various developers to collaborate is crucial. Because software engineering has existed longer than modern data science, source control has been a software engineering practice longer than a data science one. However, as we’ll demonstrate in this chapter, source control is an important tool to learn for any data scientist.

2.1 What is source control?

2.2 Introducing git

2.2.1 Basic git commands

2.3 Git workflow from scratch

2.3.1 Uploading local repository changes to a remote repository

2.3.2 Modifying a Git repository

2.3.3 How to see who made commits

2.3.4 Getting the latest changes from a remote repository

2.4 Conflicts and merging changes from different users

2.4.1 Conflict example

2.4.2 Keeping remote changes

2.4.3 Keeping local changes

2.5 How to work with branches in Git

2.5.1 Git commands for branches

2.6 Summary of Git workflow

2.7 Best practices for using source control

2.8 Comparing Jupyter Notebook files with nbdime

2.8.1 Using the nbdime package

2.9 Practice on your own

2.10 Summary