10 Performance considerations for large datasets

 

This chapter covers

  • Preparing large volumes of data to be imported into DuckDB
  • Querying metadata and running exploratory data analysis (EDA) queries on the large datasets
  • Exporting full databases concurrently to Parquet
  • Using aggregations on multiple columns to speed up statistical analysis
  • Using EXPLAIN and EXPLAIN ANALYZE to understand query plans

So far in this book, we’ve seen how to use DuckDB with a variety of datasets, but most of them have been small or medium in size. This isn’t unusual as those datasets are representative of many of the ones that we’ll come across in our daily work.

However, huge datasets do exist, and we wouldn’t want you to think that you need to use another data processing tool when you encounter these! In this chapter, we’re going to look at two datasets: the first contains data about Stack Overflow, the popular coding Q&A website, and the other data about taxi trips in New York City. With these two datasets we can teach you tips and tricks when working with bigger datasets in DuckDB.

For each one, we’ll show how to prepare and then import it into DuckDB. Once we’ve done that, we’ll run some queries on the data before exporting the database into a portable format.

10.1 Loading and querying the full Stack Overflow database

10.1.1 Data dump and extraction

10.1.2 The data model

10.1.3 Exploring the CSV file data

10.1.4 Loading the data into DuckDB

10.1.5 Fast exploratory queries on large tables

10.1.6 Posting on weekdays

10.1.7 Using enums for tags

10.2 Query planning and execution

10.2.1 Planner and optimizer

10.2.2 Runtime and vectorization

10.2.3 Visualizing query plans with Explain and Explain Analyze

10.3 Exporting the Stack Overflow data to Parquet

10.4 Exploring the New York Taxi dataset from Parquet files

10.4.1 Configuring credentials for S3 access

10.4.2 Auto inferring file types

10.4.3 Exploring Parquet schema

10.4.4 Creating views

10.4.5 Analyzing the data