6 Making your code faster and more efficient

 

This chapter covers

  • What is scaling
  • How to understand why your code is running slowly
  • How to make code faster with parallelization, including vectorization and multiprocessing
  • What is caching and how can it help you improve computational efficiency
  • Making use of Python’s data structures to optimize your code

Scalability of a codebase is essentially the ability of the code to handle large amounts of data or requests. For example, scaling a web application means being able to handle a large amount of user traffic. This might mean going from hundreds of users to millions (or even billions) of users.

In data science, scaling is often important as well. For data science, scaling usually refers to one of two main components:

  1. Optimizing code to run faster in order to handle a large number of operations in a shorter period of time
  2. Handling large datasets, including cleaning, feature engineering, and building models when your dataset may not fit in memory (or consumes a high amount of existing memory)

This chapter will focus on the first of these points, while the second point will be delved into during the next chapter. Optimizing code to run faster becomes more and more important as the data size grows. There are many possibilities where this issue involving data size arises, including:

6.1 Slow code walk-through

6.1.1 Don’t repeat yourself (DRY)

6.1.2 Line profiler

6.1.3 Reducing loops

6.2 Parallelization

6.2.1 Vectorization

6.2.2 Multiprocessing

6.2.3 Training machine learning models with parallelization

6.3 Caching

6.4 Data structures at scale

6.4.1 Sets

6.4.2 Priority Queues

6.4.3 NumPy arrays

6.5 What’s next for computational efficiency?

6.6 Practice on your own

6.7 Summary