5 Exploring data without persistence

 

This chapter covers

  • Converting CSV files to Parquet files
  • Auto-inferring file type and data schema
  • Creating views to simplify the querying of nested JSON documents
  • Exploring the metadata of Parquet files
  • Querying other databases, such as SQLite

In this chapter, we’re going to learn how to query data without persisting the data in DuckDB, a technique that is quite unusual for a database and seems counterintuitive, but which is useful in the right situations. For example, if we need to transform data from one format to another, we might not necessarily want to create an intermediate storage model while doing this.

This chapter also demonstrates the power of DuckDB’s analytical engine, even when your data isn’t stored in the native format. We’ll show how to query several common data formats, including JSON, CSV, and Parquet, as well as other databases, such as SQLite.

The JSON and CSV sources we are working with in this chapter are located in the ch05 folder of our example repository on GitHub: https://github.com/duckdb-in-action/examples. We assume you have navigated to the root of this repository before invoking the DuckDB CLI for the examples in this chapter.

5.1 Why use a database without persisting any data?

5.2 Inferring file type and schema

5.2.1 A note on CSV parsing

5.3 Shredding nested JSON

5.4 Translating CSV to Parquet

5.5 Analyzing and querying Parquet files

5.6 Querying SQLite and other databases

5.7 Working with Excel files

Summary