10 Performance considerations for large datasets
This chapter covers
- Preparing large volumes of data to be imported into DuckDB
- Querying metadata and running exploratory data analysis (EDA) queries on the large datasets
- Exporting full databases concurrently to Parquet
- Using aggregations on multiple columns to speed up statistical analysis
- Using
EXPLAIN
andEXPLAIN ANALYZE
to understand query plans
So far in this book, we’ve seen how to use DuckDB with a variety of datasets, but most of them have been small or medium in size. This isn’t unusual as those datasets are representative of many of the ones that we’ll come across in our daily work.
However, huge datasets do exist, and we wouldn’t want you to think that you need to use another data processing tool when you encounter these! In this chapter, we’re going to look at two datasets: the first contains data about Stack Overflow, the popular coding Q&A website, and the other data about taxi trips in New York City. With these two datasets we can teach you tips and tricks when working with bigger datasets in DuckDB.
For each one, we’ll show how to prepare and then import it into DuckDB. Once we’ve done that, we’ll run some queries on the data before exporting the database into a portable format.