16 Extracting Text from Web Pages

This section covers:

Rendering web-pages with HTML
The basic structure of HTML files
Extracting text from HTML files with the Beautiful Soup library
Downloading HTML files from online sources

The Internet is a great resource for text data. Millions of web pages offer limitless text content, in the form of news articles, encyclopedia pages, scientific papers, restaurant reviews, political discussions, patents, corporate financial statements, job postings, etc. All these pages can be analyzed, if we download their HTML files. HTML stands for Hypertext Markup Language. A markup language is a system for annotating documents, which distinguishes the annotations from the document text. In the case of HTML, these annotations are instructions on how to visualize a web page.

Web page visualization is usually carried out using a web browser. First, the browser downloads the page’s HTML based on its web address, which is called the URL. Next, the browser parses the HTML document for layout instructions. Finally, the browser’s rendering engine formats and displays all images and text, per the markup specifications. Afterwards, the rendered page can easily be read by a human being.

16 Extracting Text from Web Pages

This section covers:

16.1 The Structure of HTML Documents

16.2 Parsing HTML using Beautiful Soup

16.3 Downloading and Parsing Online Data

16.4 Summary

16 Extracting Text from Web Pages

This section covers:

16.1 The Structure of HTML Documents

16.2 Parsing HTML using Beautiful Soup

16.3 Downloading and Parsing Online Data

16.4 Summary

Unable to load book!