chapter eight

8 Working with a mountain of data

This chapter covers

Using a database for a more efficient data-wrangling process
Getting a huge data file into MongoDB
Working effectively with a large database
Optimizing your code for improved data throughput

This chapter addresses the question: How can we be more efficient and effective when we’re working with a massive data set?

In the last chapter, we worked with several extremely large files that were originally downloaded from the National Oceanic and Atmospheric Administration. Chapter 7 showed that it’s possible to work with CSV and JSON files that are this large! However, files of this magnitude are too big for effective use in data analysis. To be productive now, we must move our large data set to a database.

In this chapter, we move our data into a MongoDB database, and this is a big operation considering the size of the data. With our data in the database, we can work more effectively with the help of queries and other features of the database API.

I selected MongoDB for this chapter, and the book generally, because it’s my preferred database. That’s a personal (and I believe also a practical) choice, but really any database will do, and I encourage you to try out the techniques in this chapter on your database of choice. Many of the techniques presented here will work with other databases, but you’ll have to figure out how to translate the code to work with your technology of choice.

8.1 Expanding our toolkit

8.2 Dealing with a mountain of data

8.3 Getting the code and data

8.4 Techniques for working with big data

8.4.1 Start small

8.4.2 Go back to small

8.4.3 Use a more efficient representation

8.4.4 Prepare your data offline

8.5 More Node.js limitations

8.6 Divide and conquer

8.7 Working with large databases

8.7.1 Database setup

8.7.2 Opening a connection to the database

8.7.3 Moving large files to your database

8.7.4 Incremental processing with a database cursor

8.7.5 Incremental processing with data windows

8.7.6 Creating an index

8.7.7 Filtering using queries

8.7.8 Discarding data with projection

8.7.9 Sorting large data sets

8.8 Achieving better data throughput

8.8.1 Optimize your code

8.8.2 Optimize your algorithm

8.8.3 Processing data in parallel

Summary