preface

Preface

While studying information retrieval and search engines at the University of Southern California in the summer of 2005, I became interested in the Apache Nutch project. My professor, Dr. Ellis Horowitz, had recently discovered Nutch and thought it a good platform for the students in the course to get real-world experience during the final project phase of his “CS599: Seminar on Search Engines” course.

After poking around Nutch and digging into its innards, I decided on a final project. It was a Really Simple Syndication (RSS) plugin described in detail in NUTCH-30.^[1] The plugin read an RSS file, extracted its outgoing web links and text, and fed that information back into the Nutch crawler for later indexing and retrieval.

¹https://issues.apache.org/jira/browse/NUTCH-30

Seemingly innocuous, the class taught me a great detail about search engines, and helped pinpoint the area of search I was interested in—content detection and extraction.