chapter five

5 Exploring data without persistence

 

This chapter covers

  • Converting CSV files to Parquet
  • Auto inferring file type and data schema
  • Creating views to simplify the querying of nested JSON documents
  • Exploring the metadata of Parquet files
  • Querying other databases like SQLite

In this chapter, we’re going to learn how to query data without persisting the data in DuckDB, a technique that is quite unusual for a database and seems counter-intuitive, but which is useful in the right situations. For example, if we need to transform data from one format to another, we might not necessarily want to create an intermediate storage model while doing this.

This chapter also intends to show the power of DuckDB’s analytical engine in its own right, even when your data isn’t stored in the native format. We’ll show how to query several common data formats, including JSON, CSV, and Parquet, as well as other databases like SQLite.

The JSON and CSV sources that we are working with in this chapter are located in the ch05 folder of our example repository on GitHub: github.com/duckdb-in-action/examples. We assume that you navigated to the root of this repository before invoking the DuckDB CLI for the examples of this chapter.

5.1 Why use a database without persisting any data?

5.2 Inferring file type and schema

5.2.1 A note on CSV parsing

5.3 Shredding nested JSON

5.4 Translating CSV to Parquet

5.5 Analyzing and querying Parquet files

5.6 Querying SQLite and other databases

5.7 Summary