chapter five

5 Data engineering and data shaping

This chapter covers

Becoming comfortable with applying data transforms
Starting with important data manipulation packages including data.table and dplyr
Learning to control the layout of your data

This chapter will show you how to use R to organize or wrangle data into a shape useful for analysis. Data shaping is a set of steps you have to take if your data is not found all in one table or in an arrangement ready for analysis.

Figure 5.1 is the mental model for this chapter: working with data. Previous chapters have assumed the data is in a ready-to-go form, or we have pre-prepared the data to be in such a form for you. This chapter will prepare you to take these steps yourself. The basic concept of data wrangling is to visualize your data being structured to make your task easier, and then take steps to add this structure to your data. To teach this, we'll work a number of examples, each with a motivating task, and then work a transform that solves the problem. We'll concentrate on a set of transforms that are powerful and useful, and that cover most common situations.

Figure 5.1. Chapter 5 mental model

We will show data wrangling solutions using base R, data.table, and dplyr.^[1] Each of these has its advantages, which is why we are presenting more than one solution. Throughout this book, we are deliberately using a polyglot approach to data wrangling: mixing base R, data.table, and dplyr, as convenient. Each of these systems has its strengths:

5.1. Data selection

5 Data engineering and data shaping

This chapter covers

Figure 5.1. Chapter 5 mental model

5.1. Data selection

5.1.1. Subsetting rows and columns

5.1.2. Removing records with incomplete data

5.1.3. Ordering rows

5.2. Basic data transforms

5.2.1. Adding new columns

5.2.2. Other simple operations

5.3. Aggregating transforms

5.3.1. Combining many rows into summary rows

5.4. Multitable data transforms