chapter twelve

12 Case study 3 solution

This section covers

Extracting and visualizing locations
Cleaning data
Clustering locations

Our goal is to extract locations from disease-related headlines to uncover the largest active epidemics within and outside of the United States. We will do as follows:

Load the data.
Extract locations from the text using regular expressions and the GeoNamesCache library.
Check the location matches for errors.
Cluster the locations based on geographic distance.
Visualize the clusters on a map, and remove any errors.
Output representative locations from the largest clusters to draw interesting conclusions.

Warning

Spoiler alert! The solution to case study 3 is about to be revealed. I strongly encourage you to try to solve the problem before reading the solution. The original problem statement is available for reference at the beginning of the case study.

12.1 Extracting locations from headline data

We begin by loading the headline data.

Listing 12.1 Loading headline data

headline_file = open('headlines.txt','r')
headlines = [line.strip()
             for line in headline_file.readlines()]
num_headlines = len(headlines)
print(f"{num_headlines} headlines have been loaded")

650 headlines have been loaded

12 Case study 3 solution

This section covers

Warning

12.1 Extracting locations from headline data

Listing 12.1 Loading headline data

12.2 Visualizing and clustering the extracted location data

12.3 Extracting insights from location clusters

Summary