7 Dealing with huge data files

This chapter covers

Using Node.js streams
Processing files incrementally to handle large data files
Working with massive CSV and JSON files

In this chapter, we’ll learn how to tackle large data files. How large? For this chapter, I downloaded a huge data set from the National Oceanic and Atmospheric Administration (NOAA). This data set contains measurements from weather stations around the world. The zipped download for this data is around 2.7 GB. This file uncompresses to a whopping 28 GB of data. The original data set contains more than 1 billion records. In this chapter, though, we’ll work with only a portion of that data, but even the cut-down example data for this chapter doesn’t fit into the available memory for Node.js, so to handle data of this magnitude, we’ll need new techniques.

In the future, we’d like to analyze this data, and we’ll come back to that in that chapter 9. But as it stands we can’t deal with this data using conventional techniques! To scale up our data-wrangling process and handle huge files, we need something more advanced. In this chapter, we’ll expand our toolkit to include incremental processing of CSV and JSON files using Node.js streams.

7.1 Expanding our toolkit

7.2 Fixing temperature data

7.3 Getting the code and data

7.4 When conventional data processing breaks down

7.5 The limits of Node.js

7.5.1 Incremental data processing

7.5.2 Incremental core data representation

7.5.3 Node.js file streams basics primer

7.5.4 Transforming huge CSV files

7.5.5 Transforming huge JSON files

7.5.6 Mix and match

Summary