chapter ten

10 Incorporating External Data into Analyses

 

This chapter covers

  • The value of third-party and external data sources
  • Retrieving and processing data from an API
  • Web scraping and mining unstructured data
  • Tapping into public sources of data

Think of the datasets we’ve used in this book—we compiled information about daily rat sightings in New York City, and daily weather information in New York City and Boston, and we’ve worked with datasets looking at information about customers, transactions, and production costs. None of these datasets fell out of the sky or were readily available for download to use in this book. Multiple approaches to retrieving, structuring, and creating these datasets were used to prepare them for analysis.

Figure 10.1 Each of these datasets was retrieved/constructed for a specific analytical purpose.

Many data sources you access for analysis will be in a raw, unprocessed state that requires a lot of effort for data teams to make suitable for analytic purposes. Unless your job only involves working with the organization’s highly curated data warehouse, you will eventually need to take part in the data retrieval and structuring process to derive value from the information. Though it’s rarely highlighted in analytics roles you will apply for, data retrieval and structuring are the backbone of any meaningful analysis.

10.1 Leveraging Data from APIs

10.1.1 Retrieving Data: API vs. Browser Interface

10.1.2 Determining the Value of Using an API

10.1.3 Activity

10.2 Web Scraping

10.2.1 Scraping the Web for Data

10.2.2 Extracting the Data We Need

10.2.3 Determining the Value of Web Scraping

10.2.4 Activity

10.3 Tapping into Public Data Sources

10.3.1 When did public data become so popular?

10.3.2 Types of public data sources

10.3.3 Accessing Public Data

10.3.4 Activity

10.4 Summary

10.5 References