8 Working with a mountain of data
This chapter covers
- Using a database for a more efficient data-wrangling process
- Getting a huge data file into MongoDB
- Working effectively with a large database
- Optimizing your code for improved data throughput
This chapter addresses the question: How can we be more efficient and effective when we’re working with a massive data set?
In the last chapter, we worked with several extremely large files that were originally downloaded from the National Oceanic and Atmospheric Administration. Chapter 7 showed that it’s possible to work with CSV and JSON files that are this large! However, files of this magnitude are too big for effective use in data analysis. To be productive now, we must move our large data set to a database.
In this chapter, we move our data into a MongoDB database, and this is a big operation considering the size of the data. With our data in the database, we can work more effectively with the help of queries and other features of the database API.
I selected MongoDB for this chapter, and the book generally, because it’s my preferred database. That’s a personal (and I believe also a practical) choice, but really any database will do, and I encourage you to try out the techniques in this chapter on your database of choice. Many of the techniques presented here will work with other databases, but you’ll have to figure out how to translate the code to work with your technology of choice.