7 Dealing with huge data files
This chapter covers
- Using Node.js streams
- Processing files incrementally to handle large data files
- Working with massive CSV and JSON files
In this chapter, we’ll learn how to tackle large data files. How large? For this chapter, I downloaded a huge data set from the National Oceanic and Atmospheric Administration (NOAA). This data set contains measurements from weather stations around the world. The zipped download for this data is around 2.7 GB. This file uncompresses to a whopping 28 GB of data. The original data set contains more than 1 billion records. In this chapter, though, we’ll work with only a portion of that data, but even the cut-down example data for this chapter doesn’t fit into the available memory for Node.js, so to handle data of this magnitude, we’ll need new techniques.
In the future, we’d like to analyze this data, and we’ll come back to that in that chapter 9. But as it stands we can’t deal with this data using conventional techniques! To scale up our data-wrangling process and handle huge files, we need something more advanced. In this chapter, we’ll expand our toolkit to include incremental processing of CSV and JSON files using Node.js streams.