12 Mutating and transforming data frames

 

This chapter covers

  • Extracting data from ZIP archives
  • Adding and mutating columns of a data frame
  • Performing split-apply-combine transformations of data frames
  • Working with graphs and analyzing their properties
  • Creating complex plots

In chapters 8-11, you learned to create data frames and extract data from them. It is time to discuss ways in which data frames can be mutated. By data frame mutation, I mean creating new columns by using data from existing columns. For example, you might have a date column in a data frame and want to create a new column that stores the year extracted from this date. In DataFrames.jl, you can achieve this objective in two ways:

  • Update the source data frame in place by adding a new column to it.
  • Create a new data frame storing only the columns that you will later need in your data analysis pipeline.

This chapter covers both approaches. Data frame mutation is a fundamental step in all data science projects. As discussed in chapter 1, after ingesting the source data, you need to prepare it before it can be analyzed for insights. This data preparation process typically involves such tasks as data cleaning and transforming, which are usually achieved by mutating existing columns of a data frame.

12.1 Getting and loading the GitHub developers data set

12.1.1 Understanding graphs

12.1.2 Fetching GitHub developer data from the web

12.1.3 Implementing a function that extracts data from a ZIP file

12.1.4 Reading the GitHub developer data into a data frame

12.2 Computing additional node features

12.2.1 Creating a SimpleGraph object

12.2.2 Computing features of nodes by using the Graphs.jl package

12.2.3 Counting a node’s web and machine learning neighbors

12.3 Using the split-apply-combine approach to predict the developer’s type

12.3.1 Computing summary statistics of web and machine learning developer features