8 Schema management

 

This chapter covers

  • Managing schema changes in a cloud data platform
  • Understanding schema-on-read vs. an active a schema-management approach
  • Evaluating when to use schema-as-a-contract vs. a smart-pipeline approach
  • Using Spark to infer schemas in batch mode
  • Implementing a Schema Registry as part of a Metadata layer
  • Using operational metadata to manage schema changes
  • Building resilient data pipelines to manage schema changes automatically
  • Managing schema changes with backward and forward compatibility
  • Managing schema changes through to the data warehouse consumption layer

In this chapter, we will tackle the age-old problem of managing schema changes in a data system introduced when source data changes, exploring how the increase in usage of third-party data sources—i.e., SaaS—and the growing use of streaming data add to the challenge.

We will discuss how our cloud data platform design can be used to address these challenges—starting with leveraging the Schema Registry domain in the Metadata layer introduced in chapter 7 and tackling different approaches to updating schemas in the Registry—from “do nothing and wait till something breaks” to schema-as-a-contract and smart pipelines.

8.1 Why schema management

8.1.1 Schema changes in a traditional data warehouse architecture

8.1.2 Schema-on-read approach

8.2 Schema-management approaches

8.2.1 Schema as a contract

8.2.2 Schema management in the data platform

8.2.3 Monitoring schema changes

8.3 Schema Registry Implementation

8.3.1 Apache Avro schemas

8.3.2 Existing Schema Registry implementations