6 Analytics

In this chapter:

Separating development and production environments
Creating an analytics workflow
Supporting self-serve data movement

This chapter focuses on analytics, one of the major workloads a data platform needs to support. We briefly touched on the topic in chapter 3, when we took a query our data scientist used to run and implemented DevOps for it, including tracking it in source control and deploying it using an Azure Pipeline. We’ll expand on the topic in this chapter. Figure 6.1 highlights our current focus.

Figure 6.1 Analytics is one of the major workloads our data platform needs to support. This includes all reporting, insight generation, and statistical analysis data scientists run on the platform.

Our approach here will be quite different than the previous chapter: while in the previous chapter we focused on how we would implement various aspects of data processing, when it comes to analytics we should empower data scientists to do their work. That means enabling the infrastructure that allows them to self-serve their needs and putting good guardrails in place to keep things running smoothly.

Instead of focusing on the actual analytics, we will focus on how we can best design our system to require minimum engineering involvement for data movement and data processing, while maintaining a high quality bar.

6.1 Structuring storage

6.1.1 Providing development data

6.1.2 Replicating production data

6.1.3 Providing read-only access to the production data

6.1.4 Storage structure recap

6.2 Analytics workflow

6.2.1 Prototyping

6.2.2 Development and user acceptance testing

6.2.3 Production

6.2.4 Analytics workflow recap

6.3 Self-serve data movement

6.3.1 Support model

6.3.2 Data contracts

6.3.3 Pipeline validation

6.3.4 Postmortems

6.3.5 Self-serve data movement recap

6.4 Summary