Chapter 10. Structured documents

 

This chapter covers

  • Using XML to read configuration files
  • Working with HTML
  • Generating XML with Hpricot
  • Reading RSS feeds

Almost any Ruby program you write will involve either loading data from an external source or exporting data produced in your program to an external source, which will be reloaded later or loaded by another program. You might use a dead-simple representation like YAML or a more complex one like Atom to store the data, but the basic principles will remain the same.

While you’ll have the ability to choose the data format for your configuration files or external storage, you will often run into situations where you need to use data produced by someone else, often by programs written in other programming languages or even created manually by human beings. In these cases, you might need to be able to read in, and correctly interpret, broken data files. The most common example of this is reading HTML files from the internet, which are frequently impossible to parse without first repairing the data.

Thankfully, almost every structured format you might come across has an associated Ruby library that will make reading in data for use by your program, or writing out information you’ve collected, a trivial matter. Some of these libraries, like the Hpricot library that we will discuss later, also specialize in fixing broken input before giving you a simple API to parse and manipulate the data.

10.1. XML in practice

10.2. Parsing HTML and XHTML with Hpricot

10.3. Writing configuration data: revisited

10.4. Reading RSS feeds

10.5. Creating your own feed

10.6. Using YAML for data storage

10.7. Summary