concept Schema Registry in category cloud

appears as: Schema Registry, Schema Registry
Designing Cloud Data Platforms MEAP V06

This is an excerpt from Manning's book Designing Cloud Data Platforms MEAP V06.

We will group different metadata items into four main domains: Pipeline metadata (Configuration), Pipeline Activities (Activity), Data Quality Checks (Configuration) and Schema Registry (Configuration). Figure 7.4 shows these domains and the relationship among them:

Figure 7.4. Main metadata domains and their inter-relationships

Pipeline Metadata contains information about all existing data sources and data destinations as well as ingestion and transformation pipelines that are configured in the platform. Data sources and data destination store their schemas in the Schema Registry, which we will cover in detail in chapter 8. Additionally both ingestion and transformation pipelines can apply different data quality checks and information about these checks is stored in the Data Quality Checks domain. Finally, each pipeline’s execution is tracked in the Pipeline Activity domain, including success or failure status, durations and various other statistics such as the amount of data read/written, etc. If you think about data platform pipelines as applications, then Pipeline Metadata is your application configuration and Pipeline Activities are your application log files and metrics.

Schema Registry is a repository for schemas. It contains all versions of all schemas for all data sources. Data transformation pipelines or people who need to know the schema for a particular data source can fetch the latest version from the Registry. They can also explore all previous versions of the schema to understand how a particular data source has evolved over time.

Since  Schema Registry is a logical part of the Metadata layer it makes sense for us to use the same approach for implementing it as we have described in Chapter 7. We have already mentioned that Schema Registry is really just a database for schemas and their versions. We will not repeat the same three options as we have described above here, but focus on the last one which has a database and an API layer on top of it. You can still use just a database without an API layer if the number of tools and teams interacting with the Schema Registry is low. Using text files and a code repository to store schemas, similar to the simplest option for the pipelines configuration will not work here, because in our design schemas are updated automatically by the pipelines themselves. The following diagram shows how different tools will interact with the Schema Registry:

Figure 8.10 Schema Registry with an API layer on top of it can be used both by pipeline that are internal to the data platform as well as external teams and tools

When it comes to the actual database, the same key-value services that we have mentioned in Chapter 7 will work for the registry database as well: Azure CosmosDB, GCP Datastore or AWS DynamoDB. In fact, when implementing the Schema Registry in the past we have always often had the same database for the Pipelines Metadata and Schema Registry. Sometimes you may need to use separate instances of CosmosDB, Datastore or DynamoDB for the pipeline metadata and the schemas. For example if you are using a hybrid scenario where some of that data sources schemas are managed by the data platform and some are managed by the application teams you may want the application team to only have permissions to access the schema data, but not the pipeline configuration data. Fortunately, cloud makes it easy to create multiple instances of these data stores and configure granular access to them.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest