Our goal is to extract locations from disease-related headlines to uncover the largest active epidemics within and outside of the United States. We will do as follows:
- Load the data.
- Extract locations from the text using regular expressions and the GeoNamesCache library.
- Check the location matches for errors.
- Cluster the locations based on geographic distance.
- Visualize the clusters on a map, and remove any errors.
- Output representative locations from the largest clusters to draw interesting conclusions.