chapter six

Chapter 6. Intelligent web crawling

This chapter covers

A brief overview of web crawling and intelligent crawling
A step-by-step implementation of a web crawler
Crawling with Nutch
Scalable web crawling

No one knows the exact number of web pages on the Internet. But we do know that the World Wide Web is

Huge, with billions of web pages
Dynamic, with pages being constantly added, removed, or updated
Growing rapidly

Given the huge amount of information available on the Internet, how does one find information of interest?

In this chapter, we continue our theme of gathering information from outside one’s application. You’ll be introduced to the field of intelligent web crawling to retrieve relevant information. Search engines crawl the web periodically to index available content. You may be interested in crawling the web to harvest information from external sites, which can then be used in your application. Search engines such as Google and Yahoo! constantly crawl the web to gather data for their search results.

Chapter 6. Intelligent web crawling

This chapter covers

6.1. Introducing web crawling

6.2. Building an intelligent crawler step by step

6.3. Scalable crawling with Nutch

6.4. Summary

6.5. Resources