12 Case Study 3 Solution

12.1 Overview

Our goal is to extract locations from disease-related headlines in order to uncover the largest active epidemics within and outside of the United States. We will do so by:

Loading the data.
Extracting locations from the text using regular expressions and the GeoNamesCache library.
Checking the location-matches for errors.
Clustering the locations based on geographic distance.
Visualizing the clusters on a map and removing any errors.
Outputting representative locations from the largest clusters in order to draw interesting conclusions.

Warning

Spoiler alert! The solution to Case Study 3 is about to be revealed. We strongly encourage you to try and solve the problem prior to reading the solution. The original problem statement is available for reference at the beginning of Part 3.

12.2 Extracting Locations from Headline Data

We’ll begin by loading the headline data.

Listing 12.1. Loading headline data

headline_file = open('headlines.txt','r')
headlines = [line.strip()
             for line in headline_file.readlines()]
num_headlines = len(headlines)
print(f"{num_headlines} headines have been loaded")

650 headines have been loaded

12 Case Study 3 Solution

12.1 Overview

Warning

12.2 Extracting Locations from Headline Data

Listing 12.1. Loading headline data

12.3 Visualizing and Clustering the Extracted Location Data

12.4 Extracting Insights from Location Clusters

12.5 Key Takeaways