Our goal is to extract locations from disease-related headlines in order to uncover the largest active epidemics within and outside of the United States. We will do so by:
- Loading the data.
- Extracting locations from the text using regular expressions and the GeoNamesCache library.
- Checking the location-matches for errors.
- Clustering the locations based on geographic distance.
- Visualizing the clusters on a map and removing any errors.
- Outputting representative locations from the largest clusters in order to draw interesting conclusions.
Warning
Spoiler alert! The solution to Case Study 3 is about to be revealed. We strongly encourage you to try and solve the problem prior to reading the solution. The original problem statement is available for reference at the beginning of Part 3.
We’ll begin by loading the headline data.
Listing 12.1. Loading headline data
headline_file = open('headlines.txt','r') headlines = [line.strip() for line in headline_file.readlines()] num_headlines = len(headlines) print(f"{num_headlines} headines have been loaded")
650 headines have been loaded