3 Basic data management

 

This chapter covers

  • Manipulating dates and missing values
  • Understanding data type conversions
  • Creating and recoding variables
  • Sorting, merging, and subsetting datasets
  • Selecting and dropping variables

In chapter 2, we covered a variety of methods for importing data into R. Unfortunately, getting your data in the rectangular arrangement of a matrix or data frame is only the first step in preparing it for analysis. To paraphrase Captain Kirk in the Star Trek episode “A Taste of Armageddon” (and proving my geekiness once and for all), “Data is a messy business—a very, very messy business.” In my own work, as much as 60% of the time I spend on data analysis is focused on preparing the data for analysis. I’ll go out on a limb and say that the same is probably true in one form or another for most real-world data analysts. Let’s take a look at an example.

3.1   A working example

One of the topics that I study in my current job is how men and women differ in the ways they lead their organizations. Typical questions might be

  • Do men and women in management positions differ in the degree to which they defer to superiors?
  • Does this vary from country to country, or are these gender differences universal?

One way to address these questions is to have bosses in multiple countries rate their managers on deferential behavior, using questions like the following:

3.2   Creating new variables

3.3   Recoding variables

3.4   Renaming variables

3.5   Missing values

3.5.1   Recoding values to missing

3.5.2   Excluding missing values from analyses

3.6   Date values

3.6.1   Converting dates to character variables

3.6.2   Going further

3.7   Type conversions

3.8   Sorting data

3.9   Merging datasets

3.9.1   Adding columns to a data frame

3.9.2   Adding rows to a data frame

3.10  Subsetting datasets

3.10.1    Selecting variables

3.10.2    Dropping variables

3.10.3    Selecting observations

3.10.4    The subset() function

3.10.5    Random samples

3.11  Using dplyr to manipulate data frames

3.11.1    Basic dplyr functions

3.11.2    Using pipe operators to chain statements

3.12  Using SQL statements to manipulate data frames

3.13  Summary

sitemap