chapter four

4 Working with unusual data

This chapter covers

Dealing with various unusual data formats
Parsing custom text file formats using regular expressions
Using web scraping to extract data from web pages
Working with binary data formats

In the previous chapter, you learned how to import and export various standard and common data formats to the core data representation. In this chapter, we’re going to look at several of the more unusual methods of importing data that you might need to use from time to time.

Continuing from chapter 3, let’s say that you’re maintaining a website about earthquakes and you need to accept new data from a variety of sources. In this chapter, we’ll explore several of the not-so-regular data formats you might need or want to support. Table 4.1 shows the new data formats we’ll cover.

Table 4.1 Data formats covered in chapter 4

Data Format	Data Source	Notes
Custom text	Text file	Data sometimes comes in custom or proprietary text formats.
HTML	Web server / REST API	Data can be scraped from HTML web pages when no other convenient access mechanism exists.
Custom binary	Binary file	Data sometimes comes in custom or proprietary binary formats. Or we may choose to use binary data as a more compact representation.

In this chapter, we’ll add new tools to our toolkit for dealing with regular expressions, doing web scraping and decoding binary files. These tools are listed in Table 4.2.

4.1 Getting the code and data

4.2 Importing custom data from text files

4 Working with unusual data

This chapter covers

Table 4.1 Data formats covered in chapter 4

4.1 Getting the code and data

4.2 Importing custom data from text files

4.3 Importing data by scraping web pages

4.3.1 Identifying the data to scrape

4.3.2 Scraping with Cheerio

4.4 Working with binary data

4.4.1 Unpacking a custom binary file

4.4.2 Packing a custom binary file

4.4.3 Replacing JSON with BSON

4.4.4 Converting JSON to BSON

4.4.5 Deserializing a BSON file

Summary