12 Case study 3 solution

 

This section covers

  • Extracting and visualizing locations
  • Cleaning data
  • Clustering locations

Our goal is to extract locations from disease-related headlines to uncover the largest active epidemics within and outside of the United States. We will do as follows:

  1. Load the data.
  2. Extract locations from the text using regular expressions and the GeoNamesCache library.
  3. Check the location matches for errors.
  4. Cluster the locations based on geographic distance.
  5. Visualize the clusters on a map, and remove any errors.
  6. Output representative locations from the largest clusters to draw interesting conclusions.
Warning

Spoiler alert! The solution to case study 3 is about to be revealed. I strongly encourage you to try to solve the problem before reading the solution. The original problem statement is available for reference at the beginning of the case study.

12.1 Extracting locations from headline data

We begin by loading the headline data.

Listing 12.1 Loading headline data
headline_file = open('headlines.txt','r')
headlines = [line.strip()
             for line in headline_file.readlines()]
num_headlines = len(headlines)
print(f"{num_headlines} headlines have been loaded")

650 headlines have been loaded

12.2 Visualizing and clustering the extracted location data

12.3 Extracting insights from location clusters

Summary