4 Working with unusual data

 

This chapter covers

  • Dealing with various unusual data formats
  • Parsing custom text file formats using regular expressions
  • Using web scraping to extract data from web pages
  • Working with binary data formats

In the previous chapter, you learned how to import and export various standard and common data formats to the core data representation. In this chapter, we’re going to look at several of the more unusual methods of importing data that you might need to use from time to time.

Continuing from chapter 3, let’s say that you’re maintaining a website about earthquakes and you need to accept new data from a variety of sources. In this chapter, we’ll explore several of the not-so-regular data formats you might need or want to support. Table 4.1 shows the new data formats we’ll cover.

Table 4.1 Data formats covered in chapter 4
Data Format Data Source Notes
Custom text Text file Data sometimes comes in custom or proprietary text formats.
HTML Web server / REST API Data can be scraped from HTML web pages when no other convenient access mechanism exists.
Custom binary Binary file Data sometimes comes in custom or proprietary binary formats.
Or we may choose to use binary data as a more compact representation.

In this chapter, we’ll add new tools to our toolkit for dealing with regular expressions, doing web scraping and decoding binary files. These tools are listed in Table 4.2.

4.1 Getting the code and data

4.2 Importing custom data from text files

4.3 Importing data by scraping web pages

4.3.1 Identifying the data to scrape

4.3.2 Scraping with Cheerio

4.4 Working with binary data

4.4.1 Unpacking a custom binary file

4.4.2 Packing a custom binary file

4.4.3 Replacing JSON with BSON

4.4.4 Converting JSON to BSON

4.4.5 Deserializing a BSON file

Summary