3 Getting bigger and leveraging the Big 3 — Google, Amazon and Microsoft

 

This chapter covers:

  • Designing a flexible and scalable 6-layer data platform architecture that supports both batch and streaming data, meets the needs of different data consumers, and has the right foundational components for easier ongoing management.
  • Choosing the best tools and services to implement a modern cloud data platform in any of AWS, GCP, and Azure.

Chapter 2 covered setting up a simple data platform made up of a data lake and data warehouse in the cloud, with simple batch pipelines to ingest data. It also laid out the pros and cons of a data lake vs a data warehouse versus a combination of the two to produce the best analysis outcomes.

In this chapter, we’ll build upon the data platform architecture concepts introduced in Chapters 1 &  2 and we’ll layer on top of it some of the critical and more advanced functionality needed for most data platforms today. Without this added layer of sophistication your data platform would work but it wouldn’t scale easily, nor would it meet the growing data velocity challenges discussed in Chapter 1. It would also be limited in terms of the types of data consumers (people and systems who consume the data from the platform) it supports, as they too are growing in both numbers and variety.

We will take a deeper dive into a more complex cloud data platforms architecture:

3.1    Cloud data platform layered architecture

3.1.1 Data ingestion layer

3.1.2 Fast and slow storage

3.1.3 Processing layer

3.1.4 Technical Metadata layer

3.1.5 The Serving Layer and data consumers

3.1.6 Orchestration and ETL overlay layers

3.2 The importance of layers in a data platform architecture

3.3    Mapping cloud data platform layers to specific tools

3.3.1 AWS

3.3.2 Google Cloud Platform

3.3.3  Azure

3.4    Open Source and commercial alternatives

3.4.1 Batch data ingestion.

3.4.2 Streaming data ingestion and  real time analytics.

3.4.3 Orchestration layer.

3.5    Summary

sitemap