Chapter 15. Scalable scripting for large data sets: pipeline and database techniques
An online retailer needed to learn which of their Domain Name System (DNS) records were getting the most queries after an advertising campaign. They wrote a script to get this information from their DNS server logs, but as the logs grew, the script slowed to a crawl. Worse, when they tried to run the script on multiple servers remotely the script failed with an OutOfMemoryException.
A web search for the terms PowerShell and OutOfMemoryException returns many thousands of hits. People are clearly struggling to manage large data sets.
The typical problem is the use of a fragile pattern that works in the lab but doesn’t scale with real-world production data. In this chapter we’ll explore how to write scripts that scale to any size input by processing records in “streams” instead of “water balloons.”
Imagine that you need to measure the amount of water that flows down a mountain stream per hour.
One approach would be to build a huge balloon as a reservoir to stop and hold all the water in one place. Then you would measure the volume of the balloon. This could work, but it could also be slow and expensive. It also scales poorly. What if the requirement changes to measure the amount of water flowing over a year instead of an hour? What if the flow of water is higher than expected? Will the balloon pop?