Appendix B. Web crawling

 

This appendix provides an overview of web crawling components, a brief description of the implementation details for the crawler provided with the book, and a few open-source crawlers written in Java.

B.1. An overview of crawler components

Web crawlers are used to discover, download, and store content from the Web. As we’ve seen in chapter 2, a web crawler is just a part of a larger application such as a search engine.

A typical web crawler has the following components:

  • A repository module to keep track of all URLs known to the crawler.
  • A document download module that retrieves documents from the Web using provided set of URLs.
  • A document parsing module that’s responsible for extracting the raw content out of a variety of document formats, such as HTML, PDF, Microsoft Word. The parsers are also responsible for extracting URLs contained in the document and other data that can be useful during indexing phase—in particular, metadata information.
  • A repository module that stores retrieved document metadata and content extracted from the raw documents during the crawling process.
  • A URL normalization module that transforms URLs into standard form, so that they can be compared, evaluated, and so on.
  • A URL filtering module, so that the crawler can skip undesirable URLs.

B.2. References