16 Extracting Text from Web Pages

 

This section covers:

  • Rendering web-pages with HTML
  • The basic structure of HTML files
  • Extracting text from HTML files with the Beautiful Soup library
  • Downloading HTML files from online sources

The Internet is a great resource for text data. Millions of web pages offer limitless text content, in the form of news articles, encyclopedia pages, scientific papers, restaurant reviews, political discussions, patents, corporate financial statements, job postings, etc. All these pages can be analyzed, if we download their HTML files. HTML stands for Hypertext Markup Language. A markup language is a system for annotating documents, which distinguishes the annotations from the document text. In the case of HTML, these annotations are instructions on how to visualize a web page.

Web page visualization is usually carried out using a web browser. First, the browser downloads the page’s HTML based on its web address, which is called the URL. Next, the browser parses the HTML document for layout instructions. Finally, the browser’s rendering engine formats and displays all images and text, per the markup specifications. Afterwards, the rendered page can easily be read by a human being.

16.1  The Structure of HTML Documents

 
 
 

16.2  Parsing HTML using Beautiful Soup

 
 
 

16.3  Downloading and Parsing Online Data

 
 
 
 

16.4  Summary

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest