appendix

Appendix. Capturing data on the web

As you’ve learned in this book, intelligent applications are those that can change their behavior based on information. It follows, then, that we must have a mechanism for the capture and access of data. Because we’re talking about web-scale processing, it stands to reason that we may need a system designed with the following in mind:

Volume— Our system should be capable of dealing with web-scale data.
Scalability— Our system should be configurable with changing load.
Durability— Outages or network blips shouldn’t affect the eventual consistent state of data.
Latency— We shouldn’t expect to wait long periods of time between data being generated and data being processed.
Flexibility— Access to the data should be flexible, allowing multiple services to read and write from the data platform, each at different states of progress.

Typically, in the internet industry, these are issues relating to logging in the sense that when an event occurs, it has traditionally been written down in a log, or log file. In the coming sections, we’ll discuss in detail the implications of the log file before providing an alternative, which we can assess against the previous points. To set the scene, and to provide an illustrative example to refer to throughout the remainder of this appendix, we introduce a use case from the world of online advertising.

Appendix. Capturing data on the web

A motivating example: showing ads online

Data collection: a naïve approach

Managing data collection at scale

Introducing Kafka

Evaluating Kafka: data collection at scale

Kafka design patterns