6 Analytics

This chapter covers

Separating development and production environments
Creating an analytics workflow
Supporting self-serve data movement

This chapter focuses on analytics, one of the major workloads a data platform needs to support. We briefly touched on the topic in chapter 3 when we took a query from our data scientist, Mary, and implemented DevOps for it, including tracking it in source control and deploying it using an Azure Pipeline. We’ll expand on that topic in this chapter. Figure 6.1 highlights our current focus.

Figure 6.1 Analytics is one of the major workloads our data platforms need to support. This includes all reporting, insight generation, and statistical analysis that data scientists might want to run.

Our approach here will be quite different than the previous chapter. In the previous chapter, we focused on how we would implement various aspects of data processing, but when it comes to analytics, we should empower data scientists to do their work. That means enabling an infrastructure that allows them to self-serve their needs and putting good guardrails in place to keep things running smoothly.

6.1 Structuring storage

6.1.1 Providing development data

6.1.2 Replicating production data

6.1.3 Providing read-only access to the production data

6.1.4 Storage structure recap

6.2 Analytics workflow

6.2.1 Prototyping

6.2.2 Development and user acceptance testing

6.2.3 Production

6.2.4 Analytics workflow recap

6.3 Self-serve data movement

6.3.1 Support model

6.3.2 Data contracts

6.3.3 Pipeline validation

6.3.4 Postmortems