chapter seven

7 Designing the federation layer

 

This chapter covers

  • Evaluating requirements for data federation
  • Designing the federation layer components
  • Comparing Dremio and Trino for federated querying
  • Self-managed and cloud-managed federation options
  • Selecting a federation platform based on use cases

As your Apache Iceberg lakehouse takes shape, it is important to recognize that not all data will reside within Iceberg tables. Despite best efforts to centralize and standardize, some datasets will remain scattered, locked in third-party systems, legacy databases, and SaaS applications, or simply not worth the effort of extracting, transforming, and loading into your lakehouse. These realities make it essential to extend your architecture with a federation layer.

The federation layer acts as both a bridge and a harmonizer. It enables your analytics platform to access data across multiple systems without physically consolidating it. At the same time, it possibly introduces a semantic layer that standardizes business logic, ensuring consistency in metrics and datasets regardless of their origin. Whether your analysts query data through notebooks, BI dashboards, or custom applications, the federation layer that possibly provides a unified and governed interface to the underlying data ecosystem, as illustrated in figure 7.1.

7.1 What data federation is and why it matters

7.1.1 Common use cases and challenges driving federation needs

7.1.2 How federation aligns with agility and accessibility

7.2 Key requirements for federation

7.2.1 Supporting diverse data sources without duplication

7.2.2 Ensuring consistent semantics and business logic

7.2.3 Providing seamless connectivity for analytics tools

7.2.4 Introducing Dremio and Trino

7.3 Dremio

7.3.1 Dremio architecture

7.3.2 Dremio’s connector ecosystem and Iceberg-centric focus

7.3.3 Dremio’s performance enhancements

7.4 Trino

7.4.1 Modular architecture for wide-source support

7.4.2 Flexibility and configurability for complex environments

7.4.3 Community-led evolution and vendor extensions

7.4.4 Semantic layer considerations in Trino

7.5 Deployment models

7.5.1 Deployment with Dremio

7.5.2 Deployment with Trino

7.6 Federation platform decision scenarios

7.6.1 Fragmented multi-source environment: Trino for connector breadth

7.6.2 Building a native Iceberg lakehouse: Dremio for Iceberg-native features

7.6.3 Empowering business users with UI and governed datasets: Dremio

7.6.4 Lightweight querying of Hudi datasets: Trino via AWS Athena

7.6.5 On-prem Cloudera modernization: Trino replacing Impala for performance

7.6.6 Hybrid cloud Iceberg strategy: Dremio bridging on-prem and ADLS