3 Best practices for data staging

 

This chapter covers

  • Creating external stages to ingest data files from cloud storage
  • Viewing stage metadata with directory tables
  • Preparing data files for efficient ingestion
  • Querying data in external stages with external tables
  • Using materialized views to improve query performance

In this chapter, we will build upon the first data pipeline we created in chapter 2. We will add more functionality in the data ingestion step by accessing files from cloud storage rather than the local file system. We will authenticate to the cloud storage provider and create external stages.

Because requirements for data pipelines vary, there is no one-size-fits-all approach to building data pipelines. We will demonstrate and compare different options for ingesting data from cloud storage, including using the COPY command on the external stage, creating external tables, and creating materialized views to improve query performance. We will also review how to prepare data files in cloud storage for efficient ingestion.

Our first step will be to create an external stage exposing files from cloud storage to Snowflake. In this chapter, we will continue to ingest data from CSV files. In addition to CSV, Snowflake can ingest many structured and semistructured file formats, such as JSON, Parquet, or XML, which we will discuss in the next chapter.

3.1 Creating external stages

3.1.1 Configuring a storage integration

3.1.2 Creating an external stage using a storage integration

3.1.3 Creating an external stage using credentials

3.1.4 Loading data from staged files into a staging table

3.1.5 Avoiding duplication when loading data from staged files

3.1.6 Using a named file format

3.2 Viewing stage metadata with directory tables

3.3 Preparing data files for efficient ingestion

3.3.1 File sizing recommendations

3.3.2 Organizing data by path

3.4 Building pipelines with external tables

3.4.1 Querying data in external stages with external tables

3.4.2 Using materialized views to improve query performance

Summary