7 Batch queries with Azure Data Lake Analytics

This chapter covers

Writing job queries using U-SQL
Creating U-SQL jobs
Creating a Data Lake Analytics service
Estimating appropriate parallelization for U-SQL jobs

In the last chapter, you used Azure Stream Analytics as a source for raw data, using a passthrough query. The passthrough query takes incoming data and passes it to the output, in this case files in Azure Data Lake Storage (ADLS). Figure 7.1 shows this use of Stream Analytics in parallel with the serving layer.

Figure 7.1 Lambda architecture with Azure PaaS Speed layer

This is the latest example of prep work for batch processing, which includes loading files into storage and saving groups of messages into files. Azure Storage accounts, Data Lakes, and Event Hubs services set the base for building a batch processing analytics system in Azure. With files in the ADLS store, you’re ready to start doing batch processing.

In this chapter, you’ll learn how to use Azure Data Lake Analytics (ADLA) to run analysis over data stored in semi-structured files. ADLA powers the batch processing pillar of the Lambda architecture. Figure 7.2 shows ADLA as the focus of the batch layer. ADLA uses Azure’s unbounded fast storage and readily available processing nodes to make analyzing file-based data sets as easy as analyzing relational database data sets.

7 Batch queries with Azure Data Lake Analytics

This chapter covers

Figure 7.1 Lambda architecture with Azure PaaS Speed layer

Figure 7.2 Lambda architecture with Azure PaaS Batch layer

7.1 U-SQL language

7.1.1 Extractors

7.1.2 Outputters

7.1.3 File selectors

7.1.4 Expressions

7.2 U-SQL jobs

7.2.1 Selecting the biometric data files

7.2.2 Schema extraction

7.5.3 Vertexes