chapter fifteen

Chapter 15. Scalable scripting for large data sets: pipeline and database techniques

Matthew Reynolds

An online retailer needed to learn which of their Domain Name System (DNS) records were getting the most queries after an advertising campaign. They wrote a script to get this information from their DNS server logs, but as the logs grew, the script slowed to a crawl. Worse, when they tried to run the script on multiple servers remotely the script failed with an OutOfMemoryException.

A web search for the terms PowerShell and OutOfMemoryException returns many thousands of hits. People are clearly struggling to manage large data sets.

The typical problem is the use of a fragile pattern that works in the lab but doesn’t scale with real-world production data. In this chapter we’ll explore how to write scripts that scale to any size input by processing records in “streams” instead of “water balloons.”

The stream and the water balloon

Imagine that you need to measure the amount of water that flows down a mountain stream per hour.

One approach would be to build a huge balloon as a reservoir to stop and hold all the water in one place. Then you would measure the volume of the balloon. This could work, but it could also be slow and expensive. It also scales poorly. What if the requirement changes to measure the amount of water flowing over a year instead of an hour? What if the flow of water is higher than expected? Will the balloon pop?

Chapter 15. Scalable scripting for large data sets: pipeline and database techniques

The stream and the water balloon

Streams and water balloons in PowerShell scripts

Making it real: streaming over data in complex realistic tasks

If it quacks like a database ...

Summary

About the author