The past 20 years have seen a surge in interest in the development of experimental methods used to measure and improve engineered systems, such as web products, automated trading systems, and software infrastructure. Experimental methods have become more automated and more efficient. They have scaled up to large systems like search engines or social media sites. These methods generate continuous, automated performance improvement of live production systems.
Using these experimental methods, engineers measure the business impact of the changes they make to their systems and determine the optimal settings under which to run them. We call this process experimental optimization.
This book teaches several experimental optimization methods used by engineers working in trading and technology. We’ll discuss systems built by three specific types of engineers:
Machine learning engineers often work on web products like search engines, recommender systems, and ad placement systems. Quants build automated trading systems. Software engineers build infrastructure and tooling such as web servers, compilers, and event processing systems.
These engineers follow a common process, or workflow, that is an endless loop of steady system improvement. Figure 1.1 shows this common workflow.
Figure 1.1 Common engineering workflow. (1) A new idea is first implemented as a code change to the system. (2) Typically, some offline evaluation is performed that rejects ideas that are expected to negatively impact business metrics. (3) The change is pushed into the production system, and business metrics are measured there, online. Accepted changes become permanent parts of the system. The whole workflow repeats, creating reliable, continuous improvement of the system.

The common workflow creates progressive improvement of an engineered system. An individual or a team generates ideas that they expect will improve the system, and they pass each idea through the workflow. Good ideas are accepted into the system, and bad ideas are rejected:
- Implement change—First, an engineer implements an idea as a code change, an update to the system’s software. In this stage, the code is subjected to typical software engineering quality controls, like code review and unit testing. If it passes all tests, it moves on to the next stage.
- Evaluate offline—The business impact of the code change is evaluated offline, away from the production system. This evaluation typically uses data previously logged by the production system to produce rough estimates of business metrics such as revenue or the expected number of clicks on an advertisement. If these estimates show that applying this code change to the production system would worsen business metrics, then the code change is rejected. Otherwise, it is passed to the final stage.
- Measure online—The change is pushed into production, where its impact on business metrics is measured. The code change might require some configuration—the setting of numerical parameters or Boolean flags. If so, the engineer will measure business metrics for multiple configurations to determine which is best. If no improvements to business metrics can be made by applying (and configuring) this code change, then the code change is rejected. Otherwise, the change is made permanent and the system improves.
This book deals with the final stage, “measure online.” In this stage, you run an experiment on the live production system. Experimentation is valuable because it produces a measurement from the real system, which is information you couldn’t get any other way. But experimentation on a live system takes time. Some experiments take days or weeks to run. And it is not without risk. When you run an experiment, you may lose money, alienate users, or generate bad press or social media chatter as users notice and complain about the changes you’re making to your system. Therefore, you need to take measurements as quickly and precisely as possible to minimize the ill effects of ideas—call them costs for brevity—that don’t work and to take maximal advantage of ones that do.
To extract the most value from a new bit of code, you need to configure it optimally. You could liken the process of finding the best configuration to tuning an old AM or FM radio or tuning a guitar string. You typically turn a knob up and down and listen to see whether you’re getting good results. Set the knob too high or too low and your radio will be noisy, or your guitar will be sharp or flat. So it is with code configuration parameters (often referred to as knobs in code your author has read). You want them set to just the right values to give maximal business impact—whether that’s revenue or clicks or some other metric. Note that the need to run costly experiments is what specifies experimental optimization methods as a subset of optimization methods more generally.
In this chapter, we’ll discuss engineering workflows for each of the engineer types listed earlier—machine learning engineer (MLE), quant, and software engineer (SWE). We’ll see what kinds of systems they work on, the business metrics they measure, and how each stage of the generic workflow is implemented.
In your organization, you might hear of alternative ways of evaluating changes to a system. Common suggestions are domain knowledge, model-based estimates, and simulation. We’ll discuss the reason why these tools, while valuable, can’t substitute for an experimental measurement.
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

While the engineers listed earlier may work in different domains, their overall workflows are similar. Their workflows can be seen as specific cases of the common engineering workflow we described in figure 1.1: implement change, evaluate offline, measure online. Let’s look in detail at an example workflow for an MLE, for a quant, and for an SWE.
Figure 1.2 Example workflow for a machine learning engineer building a news-based website. The site contains an ML component that predicts clicks on news articles. (1) The MLE fits a new predictor. (2) An estimate of ad revenue from the new predictor is made using logs of user clicks and ad rates. (3) The new predictor is deployed to production and actual ad revenue is measured. If it improves ad revenue, then it is accepted into the system.

The key machine learning (ML) component of the website is a predictor model that predicts which news articles a user will click on. The predictor might take as input many features, such as information about the user’s demographics, the user’s previous activity on the website, and information about the news article’s title or its content. The predictor’s output will be an estimate of the probability that a specific user will click on a given news article. The website could use those predictions to rank and sort news articles on a headlines-summary page hoping to put more appealing news higher up on the page.
Figure 1.2 depicts the workflow for this system. When the MLE comes up with an idea to improve the predictor—a new feature or a new model type—the idea is subjected to the workflow:
- Implement change—The MLE fits the new predictor to logged data. If it produces better predictions on the logged data than the previous predictor, it passes to the next stage.
- Evaluate offline—The business goal is to increase revenue from ads that run on the website, not simply to improve click predictions. Translating improved predictions into improved revenue is not straightforward, but methods exist that give useful estimates for some systems. If the estimates do not look very bad, the predictor will pass on to the next stage.
- Measure online—The MLE deploys the predictor to production, and real users see their headlines ranked with it. The MLE measures the ad revenue and compares it to the ad revenue produced by the old predictor. If the new predictor improves ad revenue, then it is accepted into the system.
A news-based website may have many other components besides a click predictor. Each of those components would be exposed to the same workflow as the predictor, ensuring that the system steadily produces more ad revenue.
MLEs work on many kinds of systems. Sorting news headlines by click probability is an example of a broader class of system called a recommender system. Recommender systems are used to rank videos, music, social media posts, consumer goods, and more. Search engines are a similar ML system, in that they may rank search results specifically for the user. Targeted advertising, which chooses ads specifically for the user, is another type of MLE system. Now let’s turn to finance and see how quants follow the same workflow pattern.
A quant’s workflow is very similar to the MLE’s workflow. Only the details change. There’s a different prediction to be made, for example. See figure 1.3.
Figure 1.3 Example workflow for a quant designing an automated trading strategy. The strategy contains a price-change predictor. (1) The quant produces a new predictor. (2) Profit and risk estimates come from a simulation using historical market data. (3) Live trading measures the true profit and risk. If the new predictor increases profit and/or reduces risk, then it is accepted into the system.

This quant is building an automated trading strategy. It is a piece of software that issues BUY and SELL orders to an exchange hoping to, as they say, buy low and sell high. A key component is a model that predicts change in the price of the financial instrument (e.g., a stock) being traded. If the price is predicted to increase, it’s a good time to issue a BUY order. Similarly, if the price is predicted to decrease, it’s a good time to SELL. The business metric for this system is profit. But it’s also risk. Quants want both higher profit and lower risk. It is not uncommon (in practice, it’s the norm) to be concerned with more than one business metric when optimizing a system. Chapter 7, section 3 will discuss this important practical point in detail.
- Implement change—The quant fits the new price-change predictor to historical market data and verifies that it produces better predictions than the previous predictor.
- Evaluate offline—Better price predictions do not guarantee higher profits (or lower risk). The full trading strategy—predictor, BUY/SELL orders, and so on—is run through a simulation (also called a backtest) on historical market data. The simulation generates predictions and mimics buying and selling to estimate profit and risk. Sufficient improvement in the strategy will allow the predictor to pass to the next stage.
- Measure online—The predictor is deployed to live trading, where orders are placed and money and stock shares change hands. Only live trading can tell the true profit and risk of the strategy. The change to the predictor will be reverted if it worsens the strategy’s profit or risk.
Quants typically work on one of two types of trading systems: principal or agency. A principal strategy trades directly for the profit of the operator (the quant, or the company employing the quant). An agency strategy trades on behalf of customers as a service, helping customers reduce their trading costs.
There are many variations to these two types of strategies. They may trade stocks, futures contracts, options, or many other financial products. Each product type typically has multiple exchanges around the world on which to trade.
Also, a key defining component of a strategy is its timescale. A principal strategy owns a stock (or other instrument) for some amount of time before selling it. That amount of time may be on the order of minutes, hours, days, or weeks. Sometimes even as long as months or as short as seconds. Each timescale requires a different predictor and a different understanding of risk.
The MLE and quant workflows are similar because their systems are similar. They typically consist of a predictive model fit on data and some decision-making code that determines how the prediction is used. A software engineer’s workflow is somewhat different and is the next topic.
SWEs work on a broad range of systems. In this text, we’ll define SWE problems as those that do not involve building models from data (thus differentiating them from MLEs and quants). SWEs build compilers, caching systems, web servers, trading system infrastructure (on which trading strategies run), and much more.
As an example, let’s consider the problem of improving the response time of a search engine with the goal of lowering the “bounce rate,” which is the probability that a user will navigate away from a website after seeing just one page. Figure 1.4 shows the SWE’s workflow.
Figure 1.4 Example workflow for a software engineer building a search engine server. The server queries, aggregates, and transforms relevant data before sending the user a response. (1) The SWE changes the transformation portion of the code. (2) They time the code offline, verifying that it takes less time than the old code to transform several test data sets. (3) Running in production, the SWE measures whether the use of this new code results in a lower bounce rate, the business-relevant metric. If so, the new code is accepted as a permanent part of the system.

This SWE has built a search engine. It is a web server that responds to a user’s request by querying internal sources for a data set, transforming that data set, and delivering a formatted response to the user. Users are very sensitive to the time it takes for a web server to respond. If it takes too long, a user may navigate away from the web page before the response is delivered.
While there are many ways to slow down a web server’s response (slow browser, slow network, cache misses, etc.), this SWE has a hypothesis that it’s the data transformation step that is too slow. To fix the problem, they subject their hypothesis to the workflow:
- Implement change—The SWE implements a code change that they expect to speed up the transformation step.
- Evaluate offline—This code is run and timed offline on many samples of the internal data sets that resulted from previous user requests. If it proves to be faster, it passes to the next stage.
- Measure online—The code change is deployed to production where responses are served to real users. The SWE measures the bounce rate and compares it to the bounce rate before the code change. If the new code lowers the bounce rate, it is accepted as a permanent part of the system.
Engineering teams tend to generate many creative ideas for improving the system they work on. If these ideas are the raw material, the workflow is the factory that processes them—steadily and reliably—into system improvements.
Each pass through the workflow ends with an online measurement of business metrics. That measurement is taken via an experiment on a live production system.
discuss

The engineered systems encountered in trading and technology are complex. This complexity can make it difficult to measure the impact of changes made to them. Consider a website that sells a product. A useful business metric might be daily revenue, the total number of dollars paid to the company by customers each day. That number depends on the quality of the product, its competition, how many people know about the product, how many people have already bought it, whether people are more inclined to shop on a given day (e.g., is it a weekend? Is it Black Friday?), how easy it is to navigate and understand the website, and so on. Many, many factors affect daily revenue, and many of them are not under the control of the company.
If you were to make a change to this website and record a day’s revenue, how could you say whether the change improved that revenue? Would you have made more or less on the day you measured if you hadn’t made the change? More importantly, would you expect to make more or less in the future if you left the change in or took it out? These questions can be answered by running experiments.
Experimental methods ignore all the other factors that affect a business metric and tease out just the impact of the change you made to the system. Surprisingly, satisfyingly, experiments even account for the impact of the factors that are unknown to you, the engineer (chapter 2 discusses this in detail). It’s this ability to isolate the impact of your system change and ignore everything else that makes an experiment the right tool for the job of measuring business impact.
Experiments are indeed valuable, but that value comes at a cost. Experiments take time to run, and they risk generating suboptimal system performance (e.g., if the change the engineer just implemented makes things worse instead of better) or damaging it (e.g., due to a bug in the new code). To get the most out of experimentation, we’ll try to minimize these costs. Chapter 2 presents the idea of experiment design, where we minimize the amount of time an experiment will take to run while still giving the results we need. The subsequent chapters on experimental methods, chapters 3 through 6, all discuss ways to reduce these costs further in specific situations. Chapters 3 and 5, which cover bandit algorithms, make the experiment design adaptive, so that while the experiment is running and collecting measurements, the design steadily improves.
Recall that some system changes require the measurement of business metrics for multiple configurations to discover which is best. This induces a high measurement cost. The methods of chapters 4 and 6—response surface methodology and Bayesian optimization, respectively—use statistical inference to make good guesses about which system configurations are most promising, thus reducing the total number of measurements needed to find the best configuration.
These methods have been used in industry anywhere from 10 to 70 years (depending on the method) and are popular in the fields in which I work—quantitative trading and social media. What makes trading and technology so amenable to experimentation is that systems in these industries have many interactions with the world. Trading systems can send thousands or tens of thousands of orders per day. Websites may have from thousands to billions (for the largest websites) of requests per day. Each interaction provides an opportunity to experiment.
Drawing on personal experience, discussions with colleagues, and interviews specifically for the preparation of this book, I have tried to limit the material to a set of methods proven to work well in practice. Along with explanations of methods and real-world examples, I’ve also collected practical problems and pitfalls.
All these experimental methods assume you know your business metric. Chapter 7 discusses how to define one and how there’s usually more than one to consider. It also looks more closely at how to interpret experiment results and how that may be complicated when there are multiple metrics and multiple decision-makers involved.
Finally, chapter 8 lists ways in which real-world data can deviate from the assumptions made in the development of the experimental methods and common sources of error in interpretation of results.
One practical problem worth addressing before even getting into the details of experimentation is the question of whether you should experiment at all. It takes time and effort to build the tools needed to design, measure, and analyze changes to your system. You should get something in return for all that work. The next section discusses some common arguments against experimentation and presents counterarguments.
settings

Any SWE is likely familiar with the admonition, attributed to Donald Knuth, that “premature optimization is the root of all evil”—that is, rather than implement ideas that you expect will make your code run faster (or better in some other way) at the outset, first write simple code to solve the problem, devise a way to time the code, then test your ideas one at a time to see which ones actually speed things up. It’s too difficult to reason about everything that could affect speed—the whole code base, the computer architecture, the operating system, and so on—all at once, so you rely on a test.
Similar reasoning applies to improving business metrics. There are too many factors that could affect business metrics for a web product, including all the software engineering factors listed above, as well as data quality, model quality, changes in user sentiment, changes in browser technology, news of the day, and much more. This is the case for any engineered system: many factors affect business metrics, and they do so in complicated ways. Experimentation is necessary to accurately measure the impact on business metrics of a change to the system.
There are other tools available to assess the business-metric impact of a system change. Some examples are
These tools are discussed in detail below. You’ll see that they have two things in common: (1) they are cheaper (less resource-intensive) to use, and (2) they are less accurate than an experimental result. These tools may be useful supplements to your decision-making, but they can’t replace experiments.
Domain knowledge is the specialized knowledge of a field, a market, or a business that people acquire through education and experience. You might think this kind of knowledge would make people good at predicting which new ideas will make a positive business impact. But for the past 10 years, I’ve given an informal survey to my quant coworkers. I’ve asked, “Of the ideas you’ve implemented and tested, how many have actually worked?” The answer every single time has been 1 in 10. And it’s always been said with a chuckle and an air of resignation. That survey isn’t exactly scientific, but similar stories come from elsewhere, too. Microsoft reports that only one-third of experiments improve metrics. Amazon reports a success rate below 50%. Netflix says only 10% (see http://mng.bz/Xao6). Even though the people generating the ideas had domain knowledge, most experiments failed to produce the expected results. There seem to be aspects of the world that keep most good ideas from working.
One aspect is complexity. Your system is likely made up of many components: hardware components like computers and network switches, software components (both in-house and third-party), and human elements—operators, suppliers, customers. These components interact with each other, with the physical environment, and with society at large. Computers interact via networks. Humans interact with each other online and in person. They also interact with your servers through a browser or an API.
The physical environment includes the temperature of a data center—which, when too high, adversely impacts computer performance or causes failure. It also includes the weather, which affects people’s behavior. When the weather’s bad, do people use your product more because they can’t engage in outdoor activities? Do their posts or comments reflect their mood, which is in turn affected by the weather? There is evidence (D. Hirshleifer, T. Shumway, “Good Day Sunshine: Stock Returns and the Weather,” at www.jstor.org/stable/3094570) that sunshine in the morning in New York City is correlated with increased stock returns on that day on the New York Stock Exchange. The proposed causal mechanism is that sunshine makes the traders more optimistic. No engineer—or anyone, for that matter—could be expected to anticipate effects like this just from experience or reasoning.
To put a finer point on it, if you have N components in your system, you have ~N2 pair-wise interactions. In other words, if your system has many components, then it has a huge number of interactions. That’s too much for a person to consider when trying to guess the impact a system change will have on business metrics.
Generally, we’ll ignore most of that complexity when reasoning about a system in order to make things more manageable. We’ll create a mental model or even a mathematical model. In either case, the model of how your system operates contains the information about the system that you deemed important enough to include. In some models, this information might be called the signal. You leave out irrelevant details, which you might call noise. There’s a third category of things that affect your system’s performance: the things you didn’t even consider, because you don’t know about them. The “unknown unknowns,” they’re sometimes called (perhaps Donald Rumsfeld said it best: https://papers.rumsfeld.com/about/page/authors-note). These things could affect experimental results by any amount, either positively or negatively. You won’t anticipate them or have intuition about them because they’re missing from your model.
It’s plausible that the “unknown unknowns” of your system might include its most valuable aspects. A Harvard Business Review article (http://mng.bz/yaAq) tells the story of a proposed change to Microsoft’s Bing search engine. A domain knowledge-based decision made the change a low priority for implementation, but when it was finally coded up and put into production, it had a tremendous positive impact on revenue (over $100 million per year). It was simply the case that no one could understand the system—the code, the design, the users, and so on—completely enough to predict the dramatic impact of that change. Not because they weren’t smart. Not because they weren’t knowledgeable. Just because Bing, the user base, and the world they interact with are collectively just too complex.
If your company is competitive and surviving, there’s a good chance your “unknown unknowns” overlap with your competitors’. (My reasoning for this claim is that if your competitor discovered something valuable enough, it would either find its way into your product, too, or your company would be competed away.) If that’s the case, then to do something novel—to find value where your competitors haven’t—you’ll need to make changes to your system that you can’t evaluate with your existing domain knowledge. You’ll need to run experiments instead.
Domain knowledge is valuable. It will help you generate ideas and prioritize them—to make good bets. But domain knowledge won’t tell you outcomes. To understand impact on business metrics, you need to take experimental measurements. In addition, I posit that the most valuable changes you make to your system may come as surprises, creating impact unpredicted by domain knowledge.
It is common practice among MLEs to include a prediction model (e.g., a classifier) as a component in a system. It is not an uncommon experience to improve a model’s fit-quality metric (e.g., cross-entropy) and yet not see the business metric improve when the model is deployed.
Let’s say you build a model that predicts whether a user will click on news articles about sports. You gather a data set from production logs. It contains examples of sports articles that were presented to a user along with a record of which articles the user clicked on. Your model analyzes each article’s headline and predicts clicks very well. When you’re done building your model, you test it on out-of-sample data—data that wasn’t used in the fitting process—just to be sure you didn’t overfit. The model works great.
Next you put your model into production like this: Every time a user loads the sports news page, you sort the articles by your model’s prediction, hoping to show the articles the user is more interested in nearer to the top of the list. You find that the user isn’t more likely to click on the articles near the top. In fact, your model no longer seems to predict clicks very well. The model wasn’t overfit. You checked for that. It’s something different. The data used to fit your model was missing counterfactuals—events that happen in your system after you deploy a change but that didn’t happen before deployment.
The historical data you used to fit the model was generated by the system without your model in it. The articles were sorted some other way (perhaps sorted by date, or maybe using a different click-prediction model). When you fit your model, you were teaching it how users responded to that old system, the one with the old sorting method. Users responded differently to the new sorting method. It is difficult, if not impossible, to predict exactly how users will respond to the deployment of a new model.
The same experience might be had by a quant. They could build a new price-change prediction model using a regression, find that it has a higher R 2 (a common measure of the quality of a fit) than their old model, and works well out-of-sample, but still, when deployed, the profit of the strategy does not improve. The market is made up of traders—some algorithmic, some human—and they will respond differently to the new model’s presence in the market than they did to the old model’s. In this case, during fitting, the quant taught the new model about the old market, the one in which the new model was not a participant.
This is such a common experience that most quants and MLEs will (eventually) be familiar with it. The Facebook ML Field Guide, episode 6 (http://mng.bz/M07n) refers to this problem as the “online-offline gap.” The only way to be sure you’ve improved the system is to run the final stage of the workflow, the online measurement.
Simulations are tools that estimate a system’s business metrics offline. They might combine logged data, models of users or markets, scientific models, or heuristics. They can vary considerably in their form from domain to domain.
Simulations differ from the simple fitting metrics (cross-entropy or R2) discussed in the previous section. Simulations typically account for all components of a system and aim to produce numbers like revenue or user engagement that may be compared to the numbers that come from experimental measurements.
For example, a standard quant’s tool is a trading simulation. Offline, it runs historical market data—trades and quotes—through the same trading strategy code that is used in production. When that strategy asks to execute a trade, the simulator mimics the behavior of the market using heuristics or a model of the market. From this simulation, a quant can estimate profit, risk, shares traded, and other useful business metrics.
Simulations can give more precise answers—meaning numbers with smaller error bars—than experiments because they can use much more data. For example, a single simulation might process 1 month to 10 years of data, depending on the timescale over which the strategy trades, in a single run. This simulation might take minutes to hours to run, depending on the complexity of the strategy. An experiment, on the other hand, that takes a measurement with 1 month of data needs to run for 1 month. Want 10 years of experimental data? You’ll wait 10 years.
Simulations may also be run multiple times on the same data set. Each run could try slight variations on the same strategy and allow the quant to choose the best one—the one with the best profit-to-risk tradeoff, for example—to trade in production. With experiments multiple runs are impossible. You can’t trade for a month, say, then “rewind” real life and trade again with a different strategy. There are effective ways to compare different strategies experimentally, but the process is orders of magnitude faster in simulation.
Simulations may be more precise and faster, but experiments are more accurate. Simulations might be biased (inaccurate) because of missing counterfactuals, just like prediction models. What happens, for example, when a trading strategy sends an order to an exchange? It might show up in the market, and other traders will see and respond to it. This changes future market data, which is then seen by the trading strategy and used for its decisions, and so on. Other traders’ real responses to our actions simply don’t exist in simulation.
MLEs use simulation, too. Engineers working on Facebook Feed use a simulator that replays logged data through the Feed code that estimates users’ responses. In “Combining online and offline tests to improve News Feed ranking” (http://mng.bz/aPgB), they note that their offline simulations are biased. While the simulation results are related to real results, they don’t match exactly and the relationship between them is nontrivial. (The blog post goes on to design a model-based mapping from simulation results to experimental results.)
Researchers who study a field called evolutionary robotics design robot controllers—pieces of code that take in sensor information and output commands to a robot’s actuators—using algorithms inspired by evolution. The evolutionary algorithms search for controller parameters that optimize the performance of the robot as measured by a simulation. The researchers notice so often that controllers designed in simulation don’t work on real robots that they have coined a term for this effect: the reality gap.
In a live-streamed event, Tesla Autonomy Day (https://youtu.be/Ucp0TTmvqOE, 2:02:00-2:06:00), CEO Elon Musk is asked why Tesla relies so much on data collected from real drivers instead of training their autonomous driving controller via simulation. He says that they do use simulation, but that since they “don’t know what they don’t know”—and all of what they don’t know would be missing from the simulation—they invest effort and money into collecting lots of real data. In the same video, AI director Andrej Karpathy gives several examples of rare, unanticipated images from around the world that need to be interpreted by their vision system. Without appealing to real-world data, their system would never learn to deal with these images.
Simulation is a powerful offline design tool. Simulations can be used in the second stage of the workflow to generate estimates of business metrics. Because they tend to be biased, and you can never know exactly how, it is always necessary to test changes to your system with an experiment.
- Experimental optimization is the process of improving an engineered system using measurement-based design decisions.
- Experimental methods minimize the time and risk associated with experimental measurements.
- Experiments are the most accurate way to measure the impact on business metrics of changes to an engineered system.
- Domain knowledge, prediction models, and simulation are powerful supplements to experiments but are not replacements for them.