chapter sixteen

16 Extracting text from web pages

This section covers

Rendering web pages with HTML
The basic structure of HTML files
Extracting text from HTML files with the Beautiful Soup library
Downloading HTML files from online sources

The internet is a great resource for text data. Millions of web pages offer limitless text content in the form of news articles, encyclopedia pages, scientific papers, restaurant reviews, political discussions, patents, corporate financial statements, job postings, etc. All these pages can be analyzed if we download their Hypertext Markup Language (HTML) files. A markup language is a system for annotating documents that distinguishes the annotations from the document text. In the case of HTML, these annotations are instructions on how to visualize a web page.

Web page visualization is usually carried out using a web browser. First, the browser downloads the page’s HTML based on its web address, the URL. Next, the browser parses the HTML document for layout instructions. Finally, the browser’s rendering engine formats and displays all images and text per the markup specifications. The rendered page can easily be read by a human being.

16 Extracting text from web pages

This section covers

16.1 The structure of HTML documents

16.2 Parsing HTML using Beautiful Soup

16.3 Downloading and parsing online data

Summary