Chapter 3. Collecting data
This chapter covers
- Collecting inherently uncertain data
- Handling data collection at scale
- Querying aggregates of uncertain data
- Avoiding updating data after it’s been written to a database
This chapter begins our journey through the components, or phases, of a machine learning system (figure 3.1). Until there’s data in your machine learning system, you can’t do anything, so we’ll begin with collecting data. As you saw in chapter 1, the naive approach for getting data into a machine learning system can lead to all sorts of problems. This chapter will show you a much better way to collect data, one based on recording immutable facts. The approach in this chapter also assumes that the data being collected is intrinsically uncertain and effectively infinite.
Many people don’t even mention data collection when they discuss building machine learning systems. At first glance, it doesn’t seem as exciting as learning models or making predictions. But collecting data is crucial and a lot harder than it looks. There are no easy shortcuts to building production-grade apps that can collect vast amounts of highly variable data in an environment of change. We need to bring the full power of reactive machine learning to bear on this problem to ensure that we have good, usable data that can be consumed by other components of our machine learning systems.