chapter three

Chapter 3. Collecting data

This chapter covers

Collecting inherently uncertain data
Handling data collection at scale
Querying aggregates of uncertain data
Avoiding updating data after it’s been written to a database

This chapter begins our journey through the components, or phases, of a machine learning system (figure 3.1). Until there’s data in your machine learning system, you can’t do anything, so we’ll begin with collecting data. As you saw in chapter 1, the naive approach for getting data into a machine learning system can lead to all sorts of problems. This chapter will show you a much better way to collect data, one based on recording immutable facts. The approach in this chapter also assumes that the data being collected is intrinsically uncertain and effectively infinite.

Figure 3.1. Phases of machine learning

Many people don’t even mention data collection when they discuss building machine learning systems. At first glance, it doesn’t seem as exciting as learning models or making predictions. But collecting data is crucial and a lot harder than it looks. There are no easy shortcuts to building production-grade apps that can collect vast amounts of highly variable data in an environment of change. We need to bring the full power of reactive machine learning to bear on this problem to ensure that we have good, usable data that can be consumed by other components of our machine learning systems.

Chapter 3. Collecting data

This chapter covers

Figure 3.1. Phases of machine learning

3.1. Sensing uncertain data

3.2. Collecting data at scale

3.3. Persisting data

3.4. Applications

3.5. Reactivities

Summary