chapter eight
8 Schema Management
This chapter covers:
- How to manage the bane of data warehousing - schema changes - in a cloud data platform
By the end of this chapter you’ll be able to:
- Explain the differences between a “schema on read” approach vs an active schema management approach
- Evaluate when to use “schema-as-a-contract approach” vs a “smart pipeline” approach
- Use Spark to infer schemas in batch mode
- Implement a schema registry as part of a metadata layer
- Use the operational metadata introduced in chapter 7 to manage schema changes more easily
- Build resilient data pipelines that can manage schema changes automatically
- Manage common schema changes - adding and deleting a column ,renaming an existing column, changing column types - as it relates to backward and forward compatibility
- Manage schema changes through to the data warehouse consumption layer
In this chapter we tackle the age old problem of managing schema changes in a data system introduced when source data changes, exploring how the increase in usage of third party data sources, i.e. SaaS and the growing use of streaming data adds to the challenge.
We will discuss how our cloud data platform design can be used to address these challenges - starting with leveraging the Schema Registry domain in the metadata layer introduced in Chapter 7 and tackling different approaches to updating schemas in the registry - from “do nothing and wait till something breaks” to “schema-as-a-contract” and “smart-pipelines”