10 Incorporating external data into analyses

 

This chapter covers

  • The value of third-party and external data sources
  • Retrieving and processing data from an API
  • Web scraping and mining unstructured data
  • Tapping into public sources of data

Think of the datasets we’ve used in this book. We’ve worked with several sources of information, often combining some of them into a single resource we can use to answer questions. Most notably,

  • We explored a dataset containing the number of reported rat sightings in New York City.
  • We tracked historical weather information in New York City and Boston.
  • We analyzed customer login and transaction data for various hypothetical companies.

None of these datasets fell out of the sky or were readily available for us to download in the exact format necessary to cover each topic in this book. Multiple approaches to retrieving, structuring, and creating these datasets were used to prepare them for analysis.

This chapter will delve into common methods used to retrieve data from sources such as APIs, websites, and public databases. We’ll explore common formats in which your data can be retrieved, ensuring you can extract the information relevant to your analytical needs. Beyond running Python code associated, we’ll be focusing on the mindset an analyst needs to strategically seek out information that enriches the data available at their organization.

10.1 Using APIs

10.1.1 Retrieving data: API vs. browser interface

10.1.2 Determining the value of using an API

10.1.3 Exercises

10.2 Web scraping

10.2.1 Scraping the web for data

10.2.2 Extracting the data we need

10.2.3 Determining the value of web scraping

10.2.4 Exercises

10.3 Tapping into public data sources

10.3.1 When did public data become so popular?

10.3.2 Types of public data sources

10.3.3 Accessing public data

10.3.4 Exercises

Summary