part three

Part 3. Case Study 3: Tracking Disease Outbreaks Using News Headlines

 

CS3.1  Problem Statement

Congratulations! You have just been hired by the American Institute of Health. The Institute monitors disease epidemics in both foreign and domestic lands. A critical component of the monitoring process involves analysis of published news data. Each day, the Institute receives hundreds of news headlines describing disease outbreaks in various locations. The news headlines are too numerous to be analyzed by hand.

Your first assignment is as follows; you will process the daily quota of news headlines and extract the locations mentioned within. You will then cluster the headlines based on their geographic distribution. Finally, you will review the largest clusters within the United States and outside of the United States. Any interesting findings should be reported to your immediate superior.

CS3.1.1  Dataset Description

The file 'headlines.txt' contains the hundreds of headlines that you must analyze. Each headline appears separately on an individual line within the file.

CS3.2  Overview

In order to address the problem at hand we will need to know how to:

  1. Cluster datasets using multiple techniques and distance measures.
  2. Measure distances between locations on a spherical globe.
  3. Visualize locations on a map.
  4. Extract location coordinates from headline text.