Chapter 24. Extracting data with regular expressions
In most information management domains, such as resource management, sales, production, and accounting, natural language texts are only rarely used as a source of information. In contrast, there are domains, such as content and document management, where natural language texts pretty much represent the principal or sole source of information.
Before this information can be utilized, it needs to be extracted from the source texts. This task can be performed manually by a person reading the source, identifying individual pieces of data, and then copying and pasting them into a data entry application. Fortunately, the highly deterministic nature of this operation allows its automation.
In this chapter I’ll discuss the use of regular expressions in extracting information from texts using SQL Server 2008 (most of this chapter also applies to SQL Server 2005) and the Microsoft .NET Framework implementation of regular expressions.
To get a better understanding of the problem at hand, let’s turn to Julie “Nitpick” Eagleeyes, an imaginary analyst and editor specializing in document management. Julie not only has to deal with hundreds of thousands of documents, but also with her managers, who are interested in lowering costs, and the authors, who are interested in maintaining a steady and predictable modus operandi, performing their work in exactly the same way they have for years.