Chapter 4. Handling large data on a single computer


This chapter covers

  • Working with large data sets on a single computer
  • Working with Python libraries suitable for larger data sets
  • Understanding the importance of choosing correct algorithms and data structures
  • Understanding how you can adapt algorithms to work inside databases

What if you had so much data that it seems to outgrow you, and your techniques no longer seem to suffice? What do you do, surrender or adapt?

Luckily you chose to adapt, because you’re still reading. This chapter introduces you to techniques and tools to handle larger data sets that are still manageable by a single computer if you adopt the right techniques.

This chapter gives you the tools to perform the classifications and regressions when the data no longer fits into the RAM (random access memory) of your computer, whereas chapter 3 focused on in-memory data sets. Chapter 5 will go a step further and teach you how to deal with data sets that require multiple computers to be processed. When we refer to large data in this chapter we mean data that causes problems to work with in terms of memory or speed but can still be handled by a single computer.

4.1. The problems you face when handling large data

4.2. General techniques for handling large volumes of data

4.3. General programming tips for dealing with large data sets

4.4. Case study 1: Predicting malicious URLs

4.5. Case study 2: Building a recommender system inside a database

4.6. Summary