It’s hard to fathom just how powerful modern computers are. They perform billions of calculations per second, allowing us to have video chats with people around the world, predict the weather with incredible accuracy, and search through entire libraries of documents in the blink of an eye. An office worker from 100 years ago would be awestruck by how much data we can process and how little time it takes us to do so.
But let’s be honest: when were you last satisfied by your computer’s speed? If you’re like me, you spend very little time amazed by the speed with which our computers operate and a lot of time frustrated by how long they take to do things.
I often say that Python is the perfect language for an age in which computers are cheap and people are expensive. By that, I mean Python optimizes for programmer productivity, often at the expense of efficient execution. Things aren’t all bad; the fact that pandas uses NumPy under the hood makes it far faster and slimmer than would be the case with standard Python objects. The more we stay in the high-powered world of NumPy and away from built-in Python objects, the better it will be.
Beyond that general rule of thumb, though, there are numerous techniques for keeping your pandas data frames slim and your queries fast. Many of these come down to the simple rule of only using the data you need for a data frame. Because pandas keeps all data in memory, less data means faster processing and results.