1 An introduction to DuckDB
This chapter covers
- Why DuckDB, a single node in-memory database, emerged in the era of big data
- DuckDB’s capabilities
- How DuckDB works and fits into your data pipeline
We’re excited that you’ve picked up this book and are ready to learn about a technology that seems to go against the grain of everything that we’ve learned about big data systems over the last decade. We’ve had a lot of fun using DuckDB and we hope you will be as enthused as we are after reading this book. This book’s approach to teaching is hands-on, concise, fast-paced, and will include lots of code examples.
After reading the book you should be able to use DuckDB to analyze tabular data in a variety of formats. You will also have a new handy tool in your toolbox for data transformation, cleanup and conversion. You can integrate it into your Python notebooks and processes to replace Pandas DataFrames in situations where they are not performing. You will be able to build quick applications for data analysis using Streamlit with DuckDB.
Let’s get started!
1.1 What is DuckDB?
DuckDB is a modern embedded analytics database that runs on your machine and lets you efficiently process and query gigabytes of data from different sources. It was created in 2018 by Mark Raasveldt and Hannes Mühleisen who, at the time, were researchers in database systems at Centrum Wiskunde & Informatica (CWI) - the national research institute for mathematics and computer science in the Netherlands.