4 Working with unusual data
This chapter covers
- Dealing with various unusual data formats
- Parsing custom text file formats using regular expressions
- Using web scraping to extract data from web pages
- Working with binary data formats
In the previous chapter, you learned how to import and export various standard and common data formats to the core data representation. In this chapter, we’re going to look at several of the more unusual methods of importing data that you might need to use from time to time.
Continuing from chapter 3, let’s say that you’re maintaining a website about earthquakes and you need to accept new data from a variety of sources. In this chapter, we’ll explore several of the not-so-regular data formats you might need or want to support. Table 4.1 shows the new data formats we’ll cover.
Table 4.1 Data formats covered in chapter 4
Data Format | Data Source | Notes |
Custom text | Text file | Data sometimes comes in custom or proprietary text formats. |
HTML | Web server / REST API | Data can be scraped from HTML web pages when no other convenient access mechanism exists. |
Custom binary | Binary file | Data sometimes comes in custom or proprietary binary formats. Or we may choose to use binary data as a more compact representation. |
In this chapter, we’ll add new tools to our toolkit for dealing with regular expressions, doing web scraping and decoding binary files. These tools are listed in Table 4.2.