3 Automating the web with scraping

published book

This chapter covers

Creating structured data from web pages
Basic web scraping with cheerio
Handling dynamic content with JSDOM
Parsing and outputting structured data

In the last chapter you learned about some general Node programming techniques, but now we’re going to start focusing on web development. Scraping the web is an ideal way to do this, because it requires a combination of server and client-side programming skills. Scraping is all about using programming techniques to make sense of web pages and transform them into structured data. Imagine you’re tasked with creating a new version of a book publisher’s website that is currently just a set of old fashioned, static HTML pages. You want to download the pages and analyze them to extract the titles, descriptions, authors, and prices for all the books. You don’t want to do this by hand, so you write a Node program to do it. This is web scraping.

Node is great at scraping because it strikes a perfect balance between browser-based technology and the power of general purpose scripting languages. In this chapter you’ll learn how to use HTML parsing libraries to extract useful data based on CSS selectors, and even to run dynamic web pages in a Node process.

3.1 What is web scraping?

The term web scraping refers to the process of extracting useful information from websites. This usually involves downloading the required pages, parsing them, and then querying the raw HTML using CSS or XPath selectors. The results of the queries are then exported as CSV or saved to a database. Figure 3.1 shows how scraping works from start to finish.

Figure 3.1

Web scraping may be against the terms of use of some websites–this can be due to costs or resource limitations. If thousands of scrapers hit a single site that ran on an old and slow server then it could effectively knock the server offline. Before you scrape any content, you should ensure you have permission to access and duplicate the content. You can technically check the site’s robots.txt (http://www.robotstxt.org/) file for this information, but you should contact the site’s owners first. In some cases the site’s owners may have invited you to index its information–perhaps as part of a larger web development contract.

In this section you’ll learn about how people use scrapers for real sites, and then we’ll look at the required tools that allow Node to become a web scraping powerhouse.

3.1.1 Uses of web scraping

A great example of web scraping is the vertical search engine Octopart (https://octopart.com/). Octopart indexes electronics distributors and manufacturers to make it easier for people to find electronics. For example, you can search for resistors based on resistance, tolerance, power rating, and case type. A site like this uses web crawlers to download content, scrapers to make sense of the content and extract interesting values (like the tolerance of a resistor), and then an internal database to store the processed information.

Figure 3.2 Octopart

Web scraping isn’t just used for search engines, however. It’s also used in the growing fields of data science and data journalism. Data journalists use databases to produce stories, but because there’s so much data that isn’t stored in easily accessible formats, they may use tools like web scraping to automate the collection and processing of data. This allows journalists to present information in new ways, through data visualization techniques including infographics and interactive graphics.

3.1.2 Required tools

To get down to business you’ll need a few easily accessible tools: a web browser, and Node. Browsers are one of the most useful scraping tools: if you can right-click and select “Inspect element” then you’re already part way to making sense of websites and converting them into raw data. The next step is to parse the pages with Node. In this chapter you’re going to learn about two types of parser:

Lightweight and forgiving: cheerio
A web standards aware, Document Object Model (DOM) simulator: jsdom

Both of these libraries are installed with npm. You may also need to parse loosely structured human readable data formats like dates as well. We’ll briefly look at JavaScript’s Date.parse and Moment.js.

The first example uses cheerio, which is a fast way to parse most static web pages.

3.2 Basic web scraping with cheerio

The cheerio library (https://www.npmjs.com/package/cheerio), by Felix Böhm, is perfect for scraping because it combines two key features: fast HTML parsing, and a jQuery-like API for querying and manipulating the HTML.

Imagine you need to extract information about books from a publisher’s website. The publisher doesn’t yet have an API that exposes book details, so you need to download pages from its website and turn them into usable JSON output that includes the author name and book title. Figure 3.3 shows how scraping with cheerio works.

Figure 3.3 Scraping with cheerio

Listing 3.1 contains a small scraper that uses cheerio. Sample HTML has been included so you don’t need to worry about how to download the page itself yet.

Listing 3.1 Extracting a book’s details

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24const html = `                     #1 
<html>
<body>
  <div class="book">
    <h2>Catch-22</h2>
    <h3>Joseph Heller</h3>
    <p>A satirical indictment of military madness.</p>
  </div>
</body>
</html>`;
const cheerio = require('cheerio');
const $ = cheerio.load(html);       #2 
 
const book = {
  title: $('.book h2').text(),      #3 
  author: $('.book h3').text(),
  description: $('.book p').text()
};
 
console.log(book);
#1   Define HTML to parse
#2   Parse the entire document
#3   Extract the fields using CSS selectors

Listing 3.1 uses cheerio to parse a hardcoded HTML document by using the cheerio.load() method and CSS selectors. In a simple example like this, the CSS selectors are simple and clear, but often real-world HTML is far messier. Unfortunately poorly structured HTML is unavoidable, and your skill as a web scraper is defined by coming up with clever ways to pull out the values you need.

There are two steps to making sense of bad HTML. The first is to visualize the document, and the second is to define the selectors that target the elements you’re interested in. The second step is to use cheerio’s features to use the selector in just the right way.

Fortunately, modern browsers offer a point-and-click solution for finding selectors: if your browser has development tools, then you can usually right-click and select “Inspect element”. Not only will you see the underlying HTML, but the browser should also show a representation of the selector that targets the element.

Let’s say you’re trying to extract book information from a quirky site that uses tables without any handy CSS classes. The HTML might look like this:

1
2
3
4
5
6
7
8
9
10
11
12<html>
  <body>
    <h1>Alex's Dated Book Website</h1>
    <table>
      <tr>
        <td><a href="/book1">Catch-22</a></td>
        <td>Joseph Heller</td>
      </tr>
    </table>
  </body>
</html>

If you open that in Chrome and right-click the title, you’ll see something like figure 3.3.

Figure 3.4 Viewing HTML in Chrome

The white bar under the HTML shows “html body table tbody tr td a”–this is close to the selector that we need. But it’s not quite right, because the real HTML doesn’t have a tbody. Chrome has inserted this element. When you’re using browsers to visualize documents, you should be prepared to adjust what you discover based on the true underlying HTML. The example in figure 3.2 shows that we need to search for a link inside a table cell to get the title, and the next table cell is the corresponding author.

Assuming the above HTML is in a file called messy_html_example.html, listing 3.2 will extract the title, link, and author.

Listing 3.2 Dealing with messy HTML

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18const fs = require('fs');
const html = fs.readFileSync('./messy_html_example.html', 'utf8');    #1 
const cheerio = require('cheerio');
const $ = cheerio.load(html);
 
const book = {
  title: $('table tr td a').first().text(),         #2 
     href: $('table tr td a').first().attr('href'), #3 
  author: $('table tr td').eq(1).text()             #4 
};
 
console.log(book);
#1   Load the HTML from a file
#2   Use cheerio’s first() method to get the specific link
#3   Use cheerio’s attr() method to get the URL
#4   Use cheerio’s eq() method to skip to the second element

Listing 3.2 uses the fs module to load the HTML–that’s just so we don’t have to keep printing HTML in the example. In reality your data source might be a live website, but it could also be from a file or a database. Once the document has been parsed, first() is used to get the first table cell with an anchor. To get the anchor’s URL cheerio’s attr() method is used: it returns a specific attribute from an element, just like jQuery. The eq() method is also useful: in this listing it’s used to skip the first td, because the second contains the author’s text.

Web parsing dangers

Using a module like cheerio is a quick and dirty web of interpreting web documents. However, be careful of the type of content that you attempt to parse with it. It may throw an exception with binary data, for example, so if you use it in a web application it could potentially crash the Node process. This would be dangerous if your scraper was embedded in the same process that serves your web application.

It’s best to check the content type before passing it through a parser, and you may want to consider running your web scrapers in their own Node processes to reduce the impact of any serious crashes.

One of cheerio’s limitations is that it only allows you to work with a static version of a document: it’s used for working with pure HTML documents rather than dynamic pages that use client-side JavaScript. In the next section you’ll learn how to use JSDOM to create a browser-like environment in your Node applications, so client-side JavaScript will be executed.

3.3 Handling dynamic content with JSDOM

JSDOM is the web scraper’s dream tool: it downloads HTML, interprets it according the DOM as found in a typical browser, and runs client-side JavaScript. You can actually specify the client-side JavaScript that you want to run, which typically means including jQuery. That means you can inject jQuery (or your own custom debugging scripts) into any pages. Figure 3.5 shows how JSDOM combines HTML and JavaScript to make otherwise unscrapable content accessible.

Figure 3.5 Scraping with JSDOM

JSDOM does have some downsides, however. It’s not a perfect simulation of a browser, it’s slower than cheerio, and the HTML parser is strict so it may fail for pages with poorly written markup. Some sites just don’t make sense without client-side JavaScript support, however, so it’s an indispensible tool for some scraping tasks.

The basic usage of JSDOM is through the jsdom.env method. Listing 3.3 shows how JSDOM can be used to scrape a page by injecting jQuery and pulling out some useful values.

Listing 3.3 Scraping with JSDOM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28const jsdom = require('jsdom');
const html = `                                                       #1 
<div class="book">
  <h2>Catch-22</h2>
  <h3>Joseph Heller</h3>
  <p>A satirical indictment of military madness.</p>
</div>
`;
 
jsdom.env(html, ['./node_modules/jquery/dist/jquery.js'], scrape);   #2 
 
function scrape(err, window) {
  var $ = window.$;                                                  #3 
  $('.book').each(function() {                                       #4 
    var $el = $(this);
    console.log({
      title: $el.find('h2').text(),                                  #5 
      author: $el.find('h3').text(),
      description: $el.find('p').text()
    });
  });
}
#1   Include a suitable HTML fragment
#2   Parse the document and load jQuery
#3   Alias the jQuery object for convenience
#4   Iterate over the books using jQuery’s $.each method
#5   Use jQuery’s traversal methods to get the values of the book

To run listing 3.3, you’ll need to save jQuery locally and install jsdom[2] . You can install both with npm–the modules are called jsdom (https://www.npmjs.com/package/jsdom) and jquery (https://www.npmjs.com/package/jquery) respectively. Once everything is set up, this code should print out the title, author, and description of the HTML fragment.

The jsdom.env method is used to parse the document and inject jQuery. jQuery is injected by downloading it from npm, but you could supply the URL to jQuery on a CDN or your filesystem–JSDOM will know what to do. The jsdom.env method is asynchronous and requires a callback to work. The callback receives error and window objects–the window object is how you access the document. Here the window’s jQuery object has been aliased so it can be easily accessed with $.

A selector is used with jQuery’s .each method to iterate over each book. This example only has one book, but it demonstrates that jQuery’s traversal methods are indeed available. Each value from the book is accessed using jQuery’s traversal methods as well.

Listing 3.3 is similar to the earlier cheerio example in listing 3.1, but the main difference is that jQuery has been parsed and run by Node, within the current process. Listing 3.1 used cheerio to provide similar functionality, but cheerio provides its own jQuery-like layer. Here you’re running code intended for a browser as if it’s really running in a browser.

The jsdom.env method is only really useful for working with static pages. To parse pages that use client-side JavaScript, you’ll need to use jsdom.jsdom instead. This is a synchronous method and will return a window object that you can manipulate with other jsdom utilities. Listing 3.4 uses jsdom to parse a document with a script tag, and jsdom.jQueryify to make scraping it easier.

Listing 3.4 Parsing dynamic HTML with jsdom

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33const jsdom = require('jsdom');
const jqueryPath = './node_modules/jquery/dist/jquery.js'; #1 
const html = `
<div class="book">
  <h2></h2>                                                #2 
  <h3></h3>
  <script>
document.querySelector('h2').innerHTML = 'Catch-22';       #3 
document.querySelector('h3').innerHTML = 'Joseph Heller';
  </script>
</div>
`;
 
const doc = jsdom.jsdom(html);                             #4 
const window = doc.defaultView;
 
jsdom.jQueryify(window, jqueryPath, function() {           #5 
  var $ = window.$;
  $('.book').each(function() {
    var $el = $(this);
    console.log({
      title: $el.find('h2').text(),                        #6 
      author: $el.find('h3').text()
    });
  });
});
#1   Specify the jQuery path
#2   HTML with no static values
#3   A script that dynamically inserts the values
#4   Create an object that represents the document
#5   Insert jQuery into the document
#6   Extract the book values

Listing 3.4 requires jQuery to be installed, so if you’re creating this listing by hand you’ll need to set up a new project with npm init and npm install --save jquery jsdom. It uses a very simple HTML document where the useful values that we’re looking for are dynamically inserted. They’re inserted using client-side JavaScript found in a script tag.

This time, jsdom.jsdom is used instead of jsdom.env. It’s synchronous because the document object is created in-memory, but won’t do too much until we attempt to query or manipulate it. To do this, jsdom.jQueryify is used to insert our specific version of jQuery into the document. Once jQuery has been loaded and run, the callback is run which queries the document for the values we’re interested in and prints them to the console. The output will be:

1
2{ title: 'Catch-22', author: 'Joseph Heller' }

This proves that jsdom has invoked the necessary client-side JavaScript. Now imagine this is a real web page and you’ll see why jsdom is so powerful: even websites made with very little static HTML and dynamic technologies like Angular and React can be scraped.

3.4 Making sense of raw data

When you finally get useful data from a page, you’ll need to process it so it’s suitable for saving to a database, or an export format like CSV. Your scraped data will either be unstructured plain text or encoded using microformats.

Microformats are lightweight markup-based data formats that are used for things like addresses, calendars and events, and tags or keywords. You can find established microformats on microformats.org. Here’s an example of a name represented as a microformat:

1
2<a class="h-card" href="http://example.com">Joseph Heller</a>

Microformats are relatively easy to parse: with cheerio or jsdom a simple expression like $('.h-card').text() would be sufficient to extract “Joseph Heller”. However, plain text requires more work. In this section you’ll see how to parse dates and then convert it into more database-friendly formats.

3.4.1 Extracting unstructured values

Most web pages don’t use microformats. One area where this is problematic but potentially manageable is date values. Dates can appear in many formats, but they’re usually consistent on a given website. Once you’ve identified the format you can parse and then format the date.

JavaScript has a built-in date parser: if you run new Date('2016 01 01') an instance of Date will be returned that corresponds to the first of January, 2016. The supported input formats are determined by Date.parse, which is based on RFC2822 (http://tools.ietf.org/html/rfc2822#page-14) or ISO 8601 (http://www.w3.org/TR/NOTE-datetime). Other formats may work–it’s often worth trying it out with your source data to see what happens.

The other approach is to match values in the source data with regular expression, and then use Date’s constructor to make new Date objects. The signature for the constructor is:

1
2new Date(year, month[,day[,hour[,minutes[,seconds[,millis]]]]]);

Date parsing in JavaScript is usually good enough to handle many cases, but where it falls down is reformatting dates. A great solution to this is Moment.js (http://momentjs.com), a date parsing, validation, and formatting library. It has a fluent API, so calls can be chained like this:

1
2moment().format("MMM Do YY"); // Sep 7th 15

This is convenient for turning scraped data into CSV files that work well with programs like Excel. Image you’ve got a webpage with books that include title and published date. You want to save the values to a database, but your database requires dates to be formatted as YYYY-MM-DD. Listing 3.5 shows how you can use moment with cheerio to do this.

Listing 3.5 Parsing dates and generating CSV

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34'use strict';
const cheerio = require('cheerio');
const fs = require('fs');
const html = fs.readFileSync('./input.html');     #1 
const moment = require('moment');                 #2 
const $ = cheerio.load(html);
const books = $('.book')
  .map((i, el) => {                               #3 
    return {
      author: $(el).find('h2').text(),
      title: $(el).find('h3').text(),
      published: $(el).find('h4').text()
    };
  })
  .get();
 
console.log('title, author, sourceDate, dbDate'); #4 
 
books.forEach((book) => {
  let date = moment(new Date(book.published));    #5 
  console.log(
    '%s, %s, %s, %s',
    book.author,
    book.title,
    book.published,
    date.format('YYYY-MM-DD')
  );
});
#1   Load the input file
#2   Require moment
#3   Map each book into author, title, and published date
#4   The headers for the CSV file
#5   Parse the date

Listing 3.5 requires that you install cheerio, moment, and books. It takes as input HTML (from input.html), and outputs CSV. The HTML should have dates in h4 elements, like this:

1
2
3
4
5
6
7
8
9
10
11
12
13<div>
  <div class="book">
    <h2>Catch-22</h2>
    <h3>Joseph Heller</h3>
    <h4>11 November 1961</h4>
  </div>
  <div class="book">
    <h2>A Handful of Dust</h2>
    <h3>Evelyn Waugh</h3>
    <h4>1934</h4>
  </div>
</div>

After it has loaded the input file, it loads up moment, and then maps each book to a simple JavaScript object using cheerio’s .map and .get methods. The .map method iterates over each book, and the callback extracts each element that we’re interested in using the .find selector traversal method. To get the resulting text values as an array, .get is used.

Listing 3.5 outputs CSV using console.log. The header is printed, and then each row is logged in a loop that iterates over each book. The dates are converted to a format compatible with MySQL using moment–first the date is parsed using new Date, and then it is formatted using moment.

Once you’ve got used to parsing and formatting dates, you can apply similar techniques to other data formats. For example, currency and distance measurements can be captured with regular expressions, and then formatting using a more generic number formatting library like numeral (https://www.npmjs.com/package/numeral).

3.5 Summary

Web scraping draws on an eclectic range of skills, so this chapter covered concepts from client-side web development and server-side programming. Let's recap what you've learned in this chapter:

Web scraping is the automated transformation of sometimes badly structured web pages into computer-friendly formats like CSV or databases.
Web scraping is used for things like vertical search engines, but also for data journalism.
If you're going to scrape a site, you should get permission first. You can do this by checking the site's robots.txt file and contacting the site's owner.
The main tools are static HTML parsers (cheerio) and parsers capable of running JavaScript (JSDOM), but also browser developer tools for finding the right CSS selector for the elements you're interested in.
Sometimes the data itself is not well formatted, so you may need parse things like dates or currencies to make them work with databases.

Building on this idea of blending client-side and server-side skills, the next chapter is about full-stack web development with Node.

[2] jsdom 6.3.0 was the current version at the time of writing.