4 Azure Data Lake Storage

 

This chapter covers

  • Setting up a Data Lake store
  • Configuring file access in Data Lake Storage
  • Understanding and planning for data drift

In the last chapter, you learned how to work with a fundamental Azure service, the Storage account. Storage accounts provide nearly unlimited storage for many Azure services, with high throughput and high redundancy. Storage accounts also host other file-based services, such as file shares and queues.

In this chapter, you’ll learn about another storage service, Azure Data Lake Storage (ADLS). You’ll create a Data Lake store and learn how to structure your data lake to increase maintainability and security. You’ll learn how this service supports other Azure services through Azure Active Directory authentication. This will be the central service around which you construct the analytics system.

ADLS resembles a local file system, with folders and files. Azure Active Directory (AAD) controls access to folders and files, with assignable read/write/execute permissions. ADLS provides the primary storage backbone for the master data set, a source of data for batch layer processing. ADLS also stores batch analysis artifacts, including the report files that make up the output of the Serving layer (see figure 4.1).

Figure 4.1 Lambda architecture with Azure PaaS services

4.1 Create an Azure Data Lake store

4.1.1 Using Azure Portal

4.1.2 Using Azure PowerShell

4.2 Data Lake store access

4.2.1 Access schemes

4.2.2 Configuring access

4.2.3 Hierarchy structure in the Data Lake store

4.3 Storage folder structure and data drift

4.3.1 Hierarchy structure revisited

4.3.2 Data drift

4.4 Copy tools for Data Lake stores

Summary

sitemap