1 Data science at scale

 

This chapter covers

  • Understanding the steps involved in developing data-centric applications and describing how Kubeflow helps in managing data-centric workflows
  • Understanding containers and how Kubernetes helps manage multiple running containers (also known as container deployments)

The rise of internet companies within the last two decades has led to a glut of data. Large-scale data collection has been enabled by advances in high-speed networking and data storage technologies. This has led to the crystallization of a new field called data science. While most aspects of data science have existed for decades and while very sophisticated data analysis techniques have been used in the past, the pervasiveness and integration of these techniques is unprecedented.

Given the breadth and depth of all the skills required to effectively use data, it’s very unlikely, if not impossible for an individual to have expertise in more than a few of the core skills. Tools that let one focus on understanding the problem, the data and building effective statistical models and abstract away infrastructure concerns act as force-multipliers for data scientists.

1.1 Why are Kubeflow and Kubernetes needed?

1.2 Kubeflow and Open Data Hub

1.2.1 Operating Systems, Virtual Machines and Containers

1.2.2 Kubernetes and OKD

1.3 Summary