1 Getting started

published book

This chapter covers

Brief introductions to R and RStudio
R’s competitive edge over other programming languages
What to expect going forward

Data is changing the way businesses and other organizations work. Back in the day, the challenge was getting data; now the challenge is making sense of it, sifting through the noise to find the signal, and providing actionable insights to decision-makers. Those of us who work with data, especially on the frontend—statisticians, data scientists, business analysts, and the like—have many programming languages from which to choose.

R is a go-to programming language with an ever-expanding upside for slicing and dicing large data sets, conducting statistical tests of significance, developing predictive models, producing unsupervised learning algorithms, and creating top-quality visual content. Beginners and professionals alike, up and down an organization and across multiple verticals, rely on the power of R to generate insights that drive purposeful action.

This book provides end-to-end and step-by-step instructions for discovering and generating a series of unique and fascinating insights with R. In fact, this book differs from other manuals you might already be familiar with in several meaningful ways. First, the book is organized by project rather than by technique, which means any and every operation required to start and finish a discrete project is contained within each chapter, from loading packages, to importing and wrangling data, to exploring, visualizing, testing, and modeling data. You’ll learn how to think about, set up, and run a data science or statistics project from beginning to end.

Second, we work exclusively with data sets downloaded or scraped from the web that are available—sometimes for a small fee—to anyone; these data sets were created, of course, without any advance knowledge of how the content might be analyzed. In other words, our data sets are not plug and play. This is actually a good thing because it provides opportunities to introduce a plethora of data-wrangling techniques tied to specific data visualizations and statistical testing methods. Rather than learning these techniques in isolation, you’ll instead learn how seemingly different operations can and must work together.

Third, speaking of data visualizations, you’ll learn how to create professional-grade plots and other visual content—not just bar charts and time-series charts but also dendrograms, Sankey diagrams, pyramid plots, facet plots, Cleveland dot plots, and Lorenz curves, to name just a few visualizations that might be outside the mainstream but are nonetheless more compelling than what you’re probably used to. Often, the most effective way to tell a story or to communicate your results is through pictures rather than words or numbers. You’ll get detailed instructions for creating dozens of plot types and other visual content, some using base R functions, but most from ggplot2, R’s premier graphics package.

Fourth, this book has a professional basketball theme throughout; that’s because all the data sets are, in fact, NBA data sets. The techniques introduced in each chapter aren’t just ends in themselves but also means by which unique and fascinating insights into the NBA are ultimately revealed—all of which are absolutely transferrable to your own professional or academic work. At the end of the day, this book provides a more fun and effective way of learning R and getting further grounded in statistical concepts. With that said, let’s dive in; the following sections provide further background that will best position you to tackle the remainder of the book.

1.1 Brief introductions to R and RStudio

R is an open source and free programming language introduced in 1993 by statisticians for other statisticians. R consistently receives high marks for performing statistical computations (no surprise), producing compelling visualizations, handling massive data sets, and supporting a wide range of supervised and unsupervised learning methods.

In recent years, several integrated development environments (IDEs) have been created for R, where a source code editor, debugger, and other utilities are combined into a single GUI. By far, the most popular GUI is RStudio.

You don’t need RStudio. But imagine going through life without modern conveniences such as running water, microwaves, and dishwashers; that’s R without the benefits of RStudio. And like R, RStudio is a free download. All the code in this book was written in RStudio 1.4.1103 running on top of R 4.1.2 on a Mac laptop computer loaded with version 11.1 of the Big Sur operating system. R and RStudio run just as well on Windows and Linux desktops, by the way.

You should first download and install R (https://cran.r-project.org) and then do the same with RStudio (www.rstudio.com). You’ll indirectly interact with R by downloading libraries, writing scripts, running code, and reviewing outputs directly in RStudio. The RStudio interface is divided into four panels or windows (see figure 1.1). The Script Editor is located in the upper-left quadrant; this is where you import data, install and load libraries (also known as packages), and otherwise write code. Immediately beneath the Script Editor is the Console.

Figure 1.1 A snapshot of the RStudio interface. Code is written in the upper-left panel; programs run in the lower-left panel; the plot window is in the lower-right panel; and a running list of created objects is in the upper-right panel. Through preferences, you can set the background color, font, and font size.

The Console looks and operates like the basic R interface; this is where you review outputs from the Script Editor, including error messages and warnings when applicable. Immediately beside the Console, in the lower-right quadrant of the RStudio interface, is the Plot Window; this is where you view visualizations created in the Script Editor, manipulate their size if you so choose, and export them to Microsoft Word, PowerPoint, or other applications. And then there’s the Environment Window, which keeps a running history of the objects—data frames, tibbles (a type of data frame specific to R), and visualizations—created inside the Script Editor.

RStudio also runs in the cloud (https://login.rstudio.cloud) and is accessible through almost any web browser. This might be a good option if your local machine is low on resources.

1.2 Why R?

The size of the digital universe is expanding along an exponential curve rather than a linear line; the most successful businesses and organizations are those that collect, store, and use data more than others; and, of course, we know that R is, and has been, the programming language of choice for statisticians, data scientists, and business analysts around the world for nearly 30 years now. But why should you invest your time polishing your R skills when there are several open source and commercial alternatives?

1.2.1 Visualizing data

This book contains some 300 or so plots. Often, the most effective way of analyzing data is to visualize it. R is absolutely best in class when it comes to transforming summarized data into professional-looking visual content. So let’s first talk about pictures rather than numbers.

Several prepackaged data sets are bundled with the base R installation. This book does not otherwise use any of these objects, but here, the mtcars data set—an object just 32 rows long and 11 columns wide—is more than sufficient to help demonstrate the power of R’s graphics capabilities. The mtcars data was extracted from a 1974 issue of Motor Trend magazine; the data set contains performance and other data on 32 makes and models of automobiles manufactured in the United States, Europe, and Japan.

The following visualizations point to mtcars as a data source (see figure 1.2); they were created with the ggplot2 package and then grouped into a single 2 × 2 matrix with the patchwork package. Both of these packages, especially ggplot2, are used extensively throughout the book. (More on packages in just a moment.)

Figure 1.2 Visualizations of automobile data using the `ggplot2` package

Our visualizations include a correlation plot and facet plot along the top and a bar chart and histogram on the bottom, as described here:

Correlation plot—A correlation plot displays the relationship between a pair of continuous, or numeric, variables. The relationship, or association, between two continuous variables can be positive, negative, or neutral. When positive, the variables move in the same direction; when negative, the two variables move in opposite directions; and when neutral, there is no meaningful relationship at all.
Facet plot—A facet plot is a group of subplots that share the same horizontal and vertical axes (x-axis and y-axis, respectively); thus, each subplot must otherwise be alike. The data is split, or segmented, by groups in the data that are frequently referred to as factors. A facet plot draws one subplot for each factor in the data and displays each in its own panel. We’ve drawn boxplots to display the distribution of miles per gallon segmented by the number of cylinders and the type of transmission.
Bar chart—A bar chart, often called a bar graph, uses rectangular bars to display counts of discrete, or categorical, data. Each category, or factor, in the data is represented by its own bar, and the length of each bar corresponds to the value or frequency of the data it represents. The bars are typically displayed vertically, but it’s possible to flip the orientation of a bar chart so that the bars are instead displayed horizontally.
Histogram—Sometimes mistaken for a bar chart, a histogram is a graphical representation of the distribution of a single continuous variable. It displays the counts, or frequencies, of the data between specified intervals that are usually referred to as bins.

We can readily draw several interesting and meaningful conclusions from these four visualizations:

There is a strong negative correlation, equal to -0.87, between miles per gallon and weight; that is, heavier automobiles get fewer miles to the gallon than lighter automobiles. The slope of the regression line indicates how strongly, or not so strongly, two variables, such as miles per gallon and weight, are correlated, which is computed on a scale from -1 to +1.
Automobiles with fewer cylinders get more miles to the gallon than cars with more cylinders. Furthermore, especially regarding automobiles with either four or six cylinders, those with manual transmissions get more miles to the gallon than those with automatic transmissions.
There is a significant difference in miles per gallon depending upon the number of forward gears an automobile has; for instance, automobiles with four forward gears get 8 miles to the gallon more than automobiles equipped with just three forward gears.
The miles per gallon distribution of the 32 makes and models in the mtcars data set appears to be normal (think of a bell-shaped curve in which most of the data is concentrated around the mean, or average); however, there are more automobiles that get approximately 20 miles to the gallon or less than there are otherwise. The Toyota Corolla gets the highest miles per gallon, whereas the Cadillac Fleetwood and Lincoln Continental are tied for getting the lowest miles per gallon.

R’s reputation in the data visualization space is due to the quantity of graphs, charts, plots, diagrams, and maps that can be created and the quality of their aesthetics; it isn’t at all due to ease of use. R, and specifically the ggplot2 package, gives you the power and flexibility to customize any visual object and to apply best practices. But with customizations come complexities, such as the following:

Concerning the facet plot, for instance, where paired boxplots were created and divided by the number of cylinders in an automobile’s engine, an additional function—with six arguments—was called just to create white dots to represent the population means (ggplot2 otherwise prints a horizontal line inside a boxplot to designate the median). Another function was called so that ggplot2 returned x-axis labels that spelled out the transmission types rather than a 0 for automatic and a 1 for manual.
The bar chart, a relatively straightforward visual object, nevertheless contains several customizations. Data labels aren’t available out of the box; adding them required calling another function plus decision points on their font size and location. And because those data labels were added atop each bar, it then became necessary to extend the length of the y-axis, thereby requiring yet another line of code.
When you create a histogram, ggplot2 does not automatically return a plot with an ideal number of bins; instead, that’s your responsibility to figure out, and this usually requires some experimentation. In addition, the tick marks along the y-axis were hardcoded so that they included whole numbers only; by default, ggplot2 returns fractional numbers for half of the tick marks, which, of course, makes no sense for histograms.

This book provides step-by-step instructions on how to create these and some three dozen other types of ggplot2 visualizations that meet the highest standards for aesthetics and contain just enough bells and whistles to communicate clear and compelling messages.

1.2.2 Installing and using packages to extend R’s functional footprint

Regardless of what sort of operation you want or need to perform, there’s a great chance that other programmers preceded you. There’s also a good chance that one of those programmers then wrote an R function, bundled it into a package, and made it readily available for you and others to download. R’s library of packages continues to expand rapidly, thanks to programmers around the world who routinely make use of R’s open source platform. In a nutshell, programmers bundle their source code, data, and documentation into packages and then upload their final products into a central repository for the rest of us to download and use.

As of this writing, there are 19,305 packages stored in the Comprehensive R Archive Network (CRAN ). Approximately one-third of these were published in 2022; another one-third were published between 2019 and 2021; and the remaining one-third were published sometime between 2008 and 2018. The ggplot2 bar chart shown in figure 1.3 reveals the number of packages available in CRAN by publication year. (Note that the number of packages available is different from the number of packages published because many have since been deprecated.) The white-boxed labels affixed inside the bars represent the percentage of the total package count as of March 2023; so, for instance, of all the packages published in 2021, 3,105 remain in CRAN, which represents 16% of the total package count.

Figure 1.3 Package counts in CRAN displayed by publication year

Clearly, new packages are being released at an increasing rate; in fact, the 2023 count of new packages is on pace to approach or even exceed 12,000. That’s about 33 new packages on average every day. R-bloggers, a popular website with hundreds of tutorials, publishes a Top 40 list of new packages every month, just to help programmers sift through all the new content. These are the kinds of numbers that surely make heads spin in the commercial software world.

Packages are super easy to install: it takes just a single line of code or a couple of clicks inside the RStudio GUI to install one. This book will show you how to install a package, how to load a package into your script, and how to utilize some of the most powerful packages now available.

1.2.3 Networking with other users

R programmers are very active online, seeking support and getting it. The flurry of online activity helps you correct errors in your code, overcome other roadblocks, and be more productive. A series of searches on Stack Overflow, a website where statisticians, data scientists, and other programmers congregate for technical support, returned almost 450,000 hits for R versus just a fraction of that total, about 20%, for five leading commercial alternatives (JMP, MATLAB, Minitab, SAS, and SPSS) combined.

In the spirit of full disclosure, Python, another open source programming language, returned more hits than R—way more, in fact. But bear in mind that Python, while frequently used for data science and statistical computing, is really a general programming language, also used to develop application interfaces, web portals, and even video games; R, on the other hand, is strictly for number crunching and data analysis. So comparing R to Python is very much like comparing apples to oranges.

1.2.4 Interacting with big data

If you want or anticipate the need to interact with a typical big data technology stack (e.g., Hadoop for storage, Apache Kafka for ingestion, Apache Spark for processing), R is one of your best bets for the analytics layer. In fact, the top 10 results from a Google search on “best programming languages for big data” all list R as a top choice, while the commercial platforms previously referenced, minus MATLAB, weren’t mentioned at all.

1.2.5 Landing a job

There’s a healthy job market for R programmers. An Indeed search returned nearly 19,000 job opportunities for R programmers in the United States, more than SAS, Minitab, SPSS, and JMP combined. It’s a snapshot in time within one country, but the point nevertheless remains. (Note that many of the SAS and SPSS job opportunities are at SAS or IBM.) A subset of these opportunities was posted by some of the world’s leading technology companies, including Amazon, Apple, Google, and Meta (Facebook’s parent company). The ggplot2 bar chart shown in figure 1.4 visualizes the full results. Python job opportunities, of which there are plenty, aren’t included for the reason mentioned previously.

Figure 1.4 There’s a healthy job market for R programmers.

1.3 How this book works

As previously mentioned, this book is organized so that each of the following chapters is a standalone project—minus the final chapter, which is a summary of the entire book. That means every operation required to execute a project from wing to wing is self-contained within each chapter. The following flow diagram, or process map, provides a visual snapshot of what you can expect going forward (see figure 1.5).

Figure 1.5 A typical chapter flow and, not coincidentally, the typical end-to-end flow of most real-world data science and statistics projects

We use only base R functions—that is, out-of-the-box functions that are immediately available to you after completing the R and RStudio installations—to load packages into our scripts. After all, you can’t put a cart before a horse, and you can’t call a packaged function without first installing and loading the package. Thereafter, we rely on a mix of built-in and packaged functions, with a strong lean toward the latter, especially for preparing and wrangling our data sets and creating visual content of the same.

We begin every chapter with some hypothesis. It might be a null hypothesis that we subsequently reject or fail to reject depending on test results. In chapter 7, for instance, our going-in hypothesis is that any variances in personal fouls and attempted free throws between home and visiting teams are due to chance. We then reject that hypothesis and assume officiating bias if our statistical tests of significance return a low probability of ever obtaining equal or more extreme results; otherwise, we fail to reject that same hypothesis. Or it might merely be an assumption that must then be confirmed or denied by applying other methods. Take chapter 15, for instance, where we assume nonlinearity between the number of NBA franchises and the number of games played and won, and then create Pareto charts, visual displays of unit and cumulative frequencies, to present the results. For another example, take chapter 19, where we make the assumption that standardizing points-per-game averages by season—that is, converting the raw data to a common and simple scale—would most certainly provide a very different historical perspective on the NBA’s top scorers.

Then, we start writing our scripts. We begin every script by loading our required packages, usually by making one or more calls to the library() function. Packages must be installed before they are loaded, and they must be loaded before their functions are called. Thus, there’s no hard requirement to preface any R script by loading any package; they can instead be loaded incrementally if that’s your preference. But think of our hypothesis as the strategic plan and the packages as representing part of the tactical, or short-term, steps that help us achieve our larger goals. That we choose to load our packages up front reflects the fact that we’ve thoughtfully blueprinted the details on how to get from a starting line to the finish line.

Next, we import our data set, or data sets, by calling the read_csv() function from the readr package, which, like ggplot2, is part of the tidyverse universe of packages. That’s because all of our data sets are .csv files downloaded from public websites or created from scraped data that was then copied into Microsoft Excel and saved with a .csv extension.

This book demonstrates how to perform almost any data-wrangling operation you’ll ever need, usually by calling dplyr and tidyr functions, which are also part of the tidyverse. You’ll learn how to transform, or reshape, data sets; subset your data by rows or columns; summarize data, by groups when necessary; create new variables; and join multiple data sets into one.

This book also demonstrates how to apply best exploratory data analysis (EDA) practices . EDA is an initial but thorough interrogation of a data set, usually by mixing computations of basic statistics with correlation plots, histograms, and other visual content. It’s always a good practice to become intimately familiar with your data after you’ve wrangled it and before you test it or otherwise analyze it. We mostly call base R functions to compute basic statistical measures such as means and medians; however, we almost exclusively rely on ggplot2 functions and even ggplot2 extensions to create best-in-class visualizations.

We then test or at least further analyze our data. For instance, in chapter 5, we develop linear regression and decision tree models to isolate which hustle statistics—loose balls recovered, passes deflected, shots defended, and the like—have a statistically significant effect on wins and losses. In chapter 9, we run a chi-square test for independence, a type of statistical or hypothesis test run against two categorical variables, to determine whether permutations of prior days off between opposing home and road teams help decide who wins. Alternatively, let’s consider chapter 3, where we develop a type of unsupervised learning algorithm called hierarchical clustering to establish whether teams should have very different career expectations of a top-five draft pick versus any other first-round selection. Or take chapter 16, where we evaluate the so-called hot hand phenomenon by “merely” applying some hard-core analysis techniques, minus any formal testing.

Finally, we present our conclusions that tie back to our hypothesis: yes (or no), officials are biased toward home teams; yes (or no), rest matters in wins and losses; yes (or no), defense does, in fact, win championships. Often, our conclusions are actionable, and therefore, they naturally mutate into a series of recommendations. If some hustle statistics matter more than others, then teams should coach to those metrics; if teams want to bolster their rosters through the amateur draft, and if it makes sense to tank, or purposely lose games, as a means of moving up the draft board to select the best available players, then that’s exactly what teams should do; offenses should be designed around the probabilities of scoring within a 24-second shot clock.

Before jumping into the rest of the book, here are some caveats and other notes to consider. First, some chapters don’t flow quite so sequentially with clear delineations between, let’s say, data wrangling and EDA. Data-wrangling operations may be required throughout; it might be necessary to prep a data set as a prerequisite to exploring its contents, but other data wrangling might then be required to create visualizations. Regarding conclusions, they aren’t always held in reserve and then revealed at the end of a chapter. In addition, chapter 3 is more or less a continuation of chapter 2, and chapter 11 is a continuation of chapter 10. These one-to-many breaks are meant to consign the length of these chapters to a reasonable number of pages. However, the same flow, or process, applies, and you’ll learn just as much in chapter 2 as in chapter 3 or equally as much in chapter 10 as in chapter 11. We’ll get started by exploring a data set of first-round draft picks and their subsequent career trajectories.

Summary

R is a programming language developed by statisticians for statisticians; it’s a programming language for, and only for, crunching numbers and analyzing data.
RStudio is a GUI or IDE that controls an R session. Installing and loading packages, writing code, viewing and analyzing results, troubleshooting errors, and producing professional-quality reports are tasks made much easier with RStudio.
Against many competing alternatives—open source and commercial—R remains a best-in-class solution with regard to performing statistical computations, creating elegant visual content, managing large and complex data sets, creating regression models and applying other supervised learning methods, and conducting segmentation analysis and other types of unsupervised learning. As an R programmer, you’ll be bounded only by the limits of your imagination.
R functionality is, and has been, on a skyrocketing trajectory. Packages extend R’s functional footprint, and over half of the packages now available in CRAN were developed within the past three years. Next-generation programmers—studying at Northwestern, Berkeley, or some other college or university where the curriculum is naturally fixed on open source and free technologies—are likely to maintain R’s current trajectory for the foreseeable future.
There’s no 1-800 number to call for technical support, but there are Stack Overflow, GitHub, and other similar websites where you can interact with other R programmers and get solutions, which beats requesting a level-1 analyst to merely open a support ticket any day of the week.
R is one of the programming languages that make interacting with big data technologies user-friendly.
There’s a high demand for R programmers in today’s marketplace. An ongoing symbiotic relationship between higher education and private industry has created a vicious circle of R-based curriculum and R jobs that is likely to self-perpetuate in the years to come.