Chapter 10. Example: digital display advertising

published book

This chapter covers

Visualizing and preparing a real-world dataset
Building a predictive model of the probability that users will click a digital display advertisement
Comparing the performance of several algorithms in both training and prediction phases
Scaling by dimension reduction and parallel processing

Chapter 9 presented techniques that enable you to scale your machine-learning workflow. In this chapter, you’ll apply those techniques to a large-scale real-world problem: optimizing an online advertising campaign. We begin with a short introduction to the complex world of online advertising, the data that drives it, and some of the ways it’s used by advertisers to maximize return on advertising spend (ROAS). Then we show how to put some of the techniques in chapter 9 to use in this archetypal big-data application.

We employ several datasets in our example. Unfortunately, only a few large datasets of this type are available to the public. The primary dataset in our example isn’t available for download, and even if it were, it would be too large for personal computing.

One dataset that can be downloaded and used for noncommercial purposes is from the Kaggle Display Advertising Challenge sponsored by Criteo, a company whose business is optimizing the performance of advertising campaigns. The Criteo dataset contains more than 45 million observations of 39 features, of which 13 are numerical and 26 categorical. Unfortunately, as is common for datasets used in data science competitions, the meaning of the features is obfuscated. The variable names are V1 through V40. V1 is the label, and V2 through V40 are features. In the real world, you’d have the benefit of knowing what each feature measures or represents. But as the competition proved, you can nonetheless explore their predictive value and create useful models.

The Criteo dataset is available at https://s3-eu-west-1.amazonaws.com/criteo-labs/dac.tar.gz.

10.1. Display advertising

Half the money I spend on advertising is wasted; the trouble is, I don’t know which half.

John Wannamaker

In the days of Mad Men, this was an inescapable truth. But with digital advertising comes the opportunity to discover what works and what doesn’t via the data collected as users interact with online ads.

Online advertising is delivered through a myriad of media. Display ads appear within web pages rendered in browsers, usually on personal computers or laptops. Because the rules for identifying users and the handling of internet cookies are different on mobile browsers, mobile ad technology relies on a different set of techniques and generates quite different historical data. Native ads, embedded in games and mobile apps, and pre-roll ads that precede online video content, are based on distinct delivery technologies and require analyses tailored to their unique processes. Our examples are limited to traditional display advertising.

Much of the terminology of display advertising was inherited from the print advertising business. The websites on which ads can be purchased are known as publications, within which advertising space is characterized by size and format, or ad unit, and location within the site and page is referred to as placement. Each presentation of an ad is called an impression. Ads are sold in lots of 1,000 impressions, the price of which is known as CPM, (cost per thousand).

When a user browses to a web page—say, xyz.com—it appears that the publisher of xyz.com delivers the entire page. In reality, the page contains placeholders for advertisements that are filled in by various advertisers through a complex network of intermediaries. Each web server that delivers ads maintains logs that include information about each impression, including the publisher, the internet address of the user, and information contained in internet cookies, where information about previous deliveries from the advertiser’s server may be stored. In the next section, you’ll look at the sorts of data that’s captured during a display ad campaign.

10.2. Digital advertising data

Web servers capture data for each user request, including the following:

Client address— The IP address of the computer that made the request.
Request— The URL and parameters (for example, http://www.abc.com?x=1234&y=abc01).
Status— The response code issued by the server; usually 200, indicating successful response.
Referrer— The web page from which the user linked to the current page.
User agent— A text string that identifies the browser and operating system making the request.
Cookie— A small file stored when a browser visits a website. When the site is visited again, the file is sent along with the request.

In addition, many modern advertisements are served in conjunction with measurement programs—small JavaScript programs that capture information such as the following:

Viewability— Whether and for how long the advertisement was displayed.
User ID— Browser cookies are used to leave behind unique identifiers so that users can be recognized when encountered again.
Viewable seconds— Number of seconds advertisement was in view.

Figure 10.1 shows sample data from a campaign. Viewability data is extracted from a query string, and user_id is a randomly generated identifier that associates users with previous visits.

Figure 10.1. Impression data. Domain names are randomly generated substitutes for the real names.

10.3. Feature engineering and modeling strategy

Click is our target variable. You want to predict the likelihood that impressions will result in clicks (sometimes called click-throughs or click-thrus). More specifically, given a specific user visiting a particular site, you’d like to know the probability that the user will click the advertisement. You have several choices in formulating the problem. You can try to predict the probability that a given user will click through, and you can try to predict the click-through rate (CTR) for each publisher that presents the ad.

As is often the case, precisely what you model and the precise values you endeavor to predict will ultimately be driven by asking these questions: How will the prediction be used? In what manner will it be acted on? In this case, our advertiser has the option of blacklisting certain publications, so the advertiser’s primary concern is identifying the publications least likely to yield clicks. In recent years, real-time bidding technologies have been developed that enable advertisers to bid for individual impressions based on user and publication features provided by the bidding system, but our example advertiser hasn’t adopted real-time bidding yet.

You might wonder at this point why the advertiser doesn’t just look at some historical data for all the publications and blacklist those with low CTRs. The problem is that when the overall CTR for a campaign is in the neighborhood of 0.1%, the expected value of clicks for a publication with only a few impressions is zero. The absence of clicks doesn’t indicate a low CTR. Further, when we aggregate the best-performing, low-volume publications, we often observe above-average CTR (so just blacklisting all the low-volume pubs isn’t a good strategy). You’re looking for a model that will enable you to predict publications’ performance without the benefit of a great deal of performance history.

At first glance, you might imagine you don’t have much to work with. You can count impressions, clicks, and views for users, publishers, and operating systems. Maybe time of day or day of the week has some effect. But on further reflection, you realize that the domains a user visits are features that describe the user, and the users who visit a domain are features of the domain. Suddenly, you have a wealth of data to work with and a real-world opportunity to experience the curse of dimensionality—a phrase used to describe the tribulations of working in high-dimensional space. As you explore the data, you’ll see that a wealth of features can be, if not a curse, a mixed blessing.

You may recognize the logic you’ll apply here as the basis of recommenders, the systems that suggest movies on Netflix, products on Amazon, and restaurants on Yelp. The idea of characterizing users as collections of items, and items as collections of users, is the basis of collaborative filtering, in which users are clustered based on common item preferences, and items are clustered based on the affinities of common users. Of course, the motivation for recommenders is to present users with items they’re likely to purchase. The advertising problem is a variation; instead of many items, the same advertisement is presented in a wide variety of contexts: the publications. The driving principle is that the greatest likelihood of achieving user responses (clicks) will be on publications that are similar to those that have a history of achieving responses. And because similarity is based on common users, pubs chosen in this manner will attract people who are similar in their preferences to past responders.

10.4. Size and shape of the data

You’ll start with a sample of 9 million observations, a small-enough sample to fit into memory so you can do some quick calculations of cardinality and distributions.

Listing 10.1. A first look at the data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22%matplotlib inline
import pandas as pd
import seaborn as sns      
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_pickle('combined.pickle')                                #1
nImps = len(df)
nPubs = len(df.pub_domain.unique())
nUsers = len(df.user_id.unique())
print('nImps={}\nnPubs={}\nnUsers={}'.format(nImps, nPubs, nUsers))
nImps=9098807                                                         #2
nPubs=41576                                                           #3
nUsers=3696476                                                        #4
(nPubs * nUsers) / 1000000                                            #5
153684                                                                #6
#1 - Loads data from a compressed archive
#2 - Number of impressions
#3 - Number of publisher domains
#4 - Number of distinct users
#5 - Size of the user/item matrix divided by 1 million for readability
#6 - 153.684 billion cells—a rather large matrix

Fortunately, most users never visit most of the domains, so the user/item matrix is sparsely populated, and you have tools at your disposal for dealing with large, sparse matrices. And nobody said that users and domains must be the rows and columns of a gigantic matrix, but it turns out that some valuable algorithms work exceptionally well when it’s possible to operate on a user/item matrix in memory.

Oh, and one more thing: the 9 million observations referenced in listing 10.1 represent roughly 0.1% of the data. Ultimately, you need to process roughly 10 billion impressions, and that’s just one week’s worth of data. We loaded the data from 9 million impressions into about 53% of the memory on an Amazon Web Services (AWS) instance with 32 GB of RAM, so this will certainly get more interesting as you go.

Next, let’s look at how the data is distributed over the categorical variables. In listing 10.1, we already started this process by computing the cardinality of pub_domain and user_id.

Listing 10.2. Distributions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31import seaborn as sns                                        #1
nClicks = df.click.value_counts()[True]
print('nClicks={} ({}%)'
.format(nClicks, round(float(nClicks) * 100 / nImps, 2)))
nClicks=10845 (0.12%)
nViews = df.viewed.value_counts()[True]
print('nViews={} ({}%)'.format(nViews, 
round(float(nViews) * 100 / nImps, 2)))
nViews=3649597 (40.11%)
df.groupby('pub_domain').size()                              #2
pub_domain
D10000000.com       321
D10000001.com       117
D10000002.com       124
D10000003.com        38
D10000004.com      8170
…
f = df.groupby('pub_domain').size()
f.describe()
count      41576.000000
mean         218.847580
std         6908.203538
min            1.000000
25%            2.000000
50%            5.000000
75%           19.000000
max      1060001.000000
sns.distplot(np.log10(f)); 
#1 - Seaborn is a statistical visualization library.
#2 - Group by domain and look at number of impressions per domain

Figure 10.2 shows that many domains have a small number of impressions, and a few have large numbers of impressions. So that you can see the distribution graphically, we plotted the base 10 log rather than the raw frequencies (we use base 10 so you can think of the x-axis as 10⁰, 10¹, 10²...).

Figure 10.2. The histogram of impression data shows that the distribution of the number of impressions over publisher domains is heavily skewed.

Perhaps most significantly, you can see that clicks are relatively rare, only 0.12%, or 0.0012. This is a respectable overall click-through rate. But for this example, you need large datasets in order to have enough target examples to build your model. This isn’t unusual. We’re often trying to predict relatively rare phenomena. The capacity to process huge datasets by using big-data technologies has made it possible to apply machine learning to many whole new classes of problems.

Similarly, impression frequency by user_id is highly skewed. An average user has 2.46 impressions, but the median is 1, so a few heavy hitters pull the mean higher.

10.5. Singular value decomposition

Chapters 3 and 7 mentioned principal component analysis, or PCA, an unsupervised ML technique often used to reduce dimensions and extract features. If you look at each user as a feature of the publications they’ve interacted with, you have approximately 3.6 million features per publication, 150 billion values for your exploratory sample of data. Obviously, you’d like to work with fewer features, and fortunately you can do so fairly easily.

As it turns out, PCA has several algorithms, one of which is singular value decomposition, or SVD. You can explain and interpret SVD mathematically in various ways, and mathematicians will recognize that our explanation here leaves out some of the beauty of the underlying linear algebra. Fortunately, like the latent semantic analysis covered in chapter 7, SVD has an excellent implementation in the scikit-learn Python library. But this time, let’s do just a little bit of the matrix algebra. If you’ve done matrix multiplication, you know that dimensions are important. If A_{[n x p]} denotes an n-by-p matrix, you can multiple A by another matrix whose dimensions are p by q (for example, B_{[p x q]}), and the result will have dimensions of n by q (say, C_{[n x q]}). It turns out that any matrix can be factored into three components, called the left and right singular vectors and the singular values, respectively.

In this example, n is the number of users, each of which is represented by a row in matrix A, and p is the number of pubs, each of which is represented by a column:

What makes this interesting is that the singular values tell you something about the importance of the features represented by the left and right singular vectors (the vectors are the rows of U and V^T). In particular, the singular values tell you the extent to which the corresponding feature vectors are independent. Consider the implication of interdependent or covariant features. Or to make it a bit easier, imagine that two features, A and B, are identical. After feature A has been considered by the model, feature B has nothing to contribute. It contains no new information. As builders of predictive models, the features you want are independent, and each one is at least a weak predictor of your target. If you have many weak predictors, so long as their predictions are better than random, in combination they gain strength. But this phenomenon, the ensemble effect, works only when features are independent.

Let’s run SVD on our advertising data and have a look at the resulting singular values.

Listing 10.3. SVD on advertising data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18user_idx, pub_idx = {}, {}                                           #1
for i in range(len(users)):
    user_idx[users[i]] = i
for i in range(len(pubs)):
    pub_idx[pubs[i]] = i
nTrainUsers = len(df.user_id.unique())                               #2
nTrainPubs = len(df.pub_domain.unique())
V = sp.lil_matrix((nTrainUsers, nTrainPubs))
def matput(imp):
if imp.viewed:
            V[user_idx[imp.user_id], pub_idx[imp.pub_domain]] = 1
df5[df5.click == True].apply(matput, axis=1)
# run svds (svd for sparse matrices)
u, s, vt = svds(V, k = 1550)
plt.plot(s[::-1])
#1 - First substitutes integer indices for user and pub symbolic keys
#2 - Creates a sparse matrix of user/pub interactions

When you ran SVD, you used the k = maximum singular values parameter to limit the calculation to the 1,550 largest singular values. Figure 10.3 shows their magnitude; you can see that there are about 1,425 nonzero values, and that beyond the 450 most independent feature vectors, the rest are highly covariant. This isn’t surprising. Although there are over 3 million users, remember that most of them interact with very few pubs. Consider that of these, 136,000 were observed exactly once (on ebay.com, by the way). So if each user vector is a feature of the pub, ebay.com has 136,000 features that are identical.

Figure 10.3. Singular values for advertising data

Our SVD reduced more than 3 million features to around 7 thousand, a 400:1 reduction. Knowing this, you have a much better sense of the resources that will be needed. In the next section, you’ll look at ways to size and optimize the resources necessary to train your models.

10.6. Resource estimation and optimization

So far, you’ve looked at the cardinalities and distributions that characterize your data and done some feature engineering. In this section, you’ll assess the task at hand in terms of the computational workload relative to the resources you have at your disposal.

To estimate resource requirements, you need to start with some measurements. First let’s look at your available resources. So far, you’ve been using a single m4.2xlarge Amazon EC2 instance. Let’s decode that quickly. EC2 is Amazon’s Elastic Compute Cloud. Each instance is a virtual server with dedicated CPU, random access memory (RAM), and disk or solid-state online storage. The m4.2xlarge designation means a server with eight cores and 32 GB of memory. Disk space is provisioned separately. Our single instance has 1 terabyte of elastic block storage (EBS). EBS is virtualized storage, set up so that it appears that your instance has a dedicated 1 TB disk volume. You’ve set up your instance to run Linux. Depending on your needs, you can easily upgrade your single instance to add cores or memory, or you can provision more instances.

Next, let’s have a look at your workload. Your raw data resides in transaction files on Amazon’s Simple Storage Service, S3, which is designed to store large quantities of data inexpensively. But access is a lot slower than a local disk file. Each file contains around 1 million records. You can read approximately 30,000 records per second from S3, so if you process them one at a time, 10 billion will take about 92 hours. Downloading from S3 can be speeded up by around 75%, by processing multiple downloads in parallel (on a single instance), so that gets you down to 23 hours.

But speed isn’t your only problem. Based on your earlier observation that 10 million records loaded into memory consume 53% of your 32 GB of memory, it would take 1.7 terabytes of memory to load your entire dataset. Even if you could afford it, Amazon doesn’t have an instance with that much RAM.

Fortunately, you don’t need all the data in memory. Furthermore, your requirement isn’t just a function of the size of the data, but of its shape—by which we mean the cardinality of its primary keys. It turns out that there are 10 billion records, but only about 10 million users and around 300 thousand pubs, which means the user/pub matrix is around 3 trillion entries. But when you populated your sparse matrix, there were values in only about 0.01% of the cells, so 3 trillion is reduced to 300 million. Assuming one 64-bit floating-point number per value, your user/pub matrix will fit in about 2.5 of your 32 GB.

To cut processing time, you need to look at doing things in parallel. Figure 10.4 illustrates using worker nodes (additional EC2 instances, in this case) to ingest the raw data in parallel.

Figure 10.4. Parallel processing scales the initial data acquisition.

The worker nodes do more than read the data from S3. Each one independently builds a sparse matrix of users and items. When all the workers are finished with their jobs, these are combined by your compute node.

Chapter 9 described some big-data technologies: Hadoop, MapReduce, and Apache Spark. The processes described here are a highly simplified version of what happens in a MapReduce job. A large task is broken into small units, each of which is dispatched (mapped) to a worker. As workers complete their subtasks, the results are combined (reduced), and that result is returned to the requestor. Hadoop optimizes this process in several ways. First, rather than having the workers retrieve data over a network, each worker node stores part of the data locally. Hadoop optimizes the assignment of tasks so that whenever possible, each node works on data that’s already on a local volume. Spark goes one step further by having the worker nodes load the data into memory so they don’t need to do any I/O operations in order to process the tasks they’re assigned.

Although this example problem is large enough to require a little parallel processing, it’s probably not worth the effort required to implement one of these frameworks. You need to run your entire workflow only once per day, and you could easily add a few more instances and get the whole process down to an hour or less. But you can easily imagine an application requiring you to run a variety of processes at a greater frequency, where having the worker nodes retain the raw data in memory over the course of many processing cycles would boost performance by orders of magnitude.

10.7. Modeling

Your goal for the model is to predict CTR for each pub. You started with user interactions as features and used SVD to reduce the feature space. From here, there are several approaches to making predictions. Your first model will be a k-nearest neighbors (KNN) model. This is a simple but surprisingly effective recommender model.

You’ll also train a random forest regressor. Random forests are a form of decision-tree-based learning; many random samples of data and random subsets of the feature set are selected, and decision trees are constructed for each selection.

10.8. K-nearest neighbors

Figure 10.5 shows simplified user/item and dissimilarity matrices. Notice that the diagonal of the dissimilarity matrix is all zeros because each pub’s user vector (column in the user/item matrix) is identical to itself, and therefore zero distance from itself. You can see that the distance between pub3, pub4, and pub7 is zero, as you’d expect, because their respective columns in the user/item matrix are identical. Also note that pub1’s distance to pub5 is the same as pub5’s distance to pub1. In other words, dissimilarity is symmetric. Interestingly, some recommender algorithms don’t define distance symmetrically. Item A may be like item B, but item B isn’t like item A.

Figure 10.5. The dissimilarity, or distance, matrix shows the extent to which user interactions are similar or different. In this example, the user/item matrix is binary, indicating whether the user has interacted with the pub.

You compute the similarity (actually, dissimilarity, or distance) between each pair of pubs, using one of several available measures. You then choose the most common, the Euclidean distance.

After you’ve computed pairwise distances, the next step is to compute your predicted CTR for each pub. In KNN, the predicted target value is calculated by averaging the values of the target values for k-nearest neighbors, presuming that each example observation will be most similar to its nearest neighbors. There are several important questions at this juncture. First, what should you choose for the value of k? How many neighbors should be considered? Also, it’s common to give greater weight to the closest neighbors, usually by weighting the calculation of the mean target value by 1/distance or [1/distance]².

Listing 10.4 shows a calculation of predicted values for a range of possible values of k by using scikit-learn NearestNeighbors. Here you try three weighting formulas, each of 20 values of k. Figure 10.6 shows that the best predictors are one or two nearest neighbors, and averaging over a larger range offers no real improvement. This is probably because our data is sparse, and nearest neighbors are often fairly distant. Note that the variation over the values of k is also small. In any case, the normalized RMSE for our test set predictions is in the range of 5%. Not bad!

Figure 10.6. RMSE for three weighting functions and values of k = 1 to k = 30

Listing 10.4. KNN predictions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33from sklearn.neighbors import NearestNeighbors
weightFunctions = {
    'f1': lambda x: [1 for i in range(len(x))],                        #1
    'f2': lambda x: 1 / x,                                             #2
    'f3': lambda x: 1 / x ** 2                                         #3
}
for idx, f in enumerate(weightFunctions):                              #4
    rmseL = []
    wf = weightFunctions[f]
    for nNeighbors in range(1,20, 1):
        neigh = NearestNeighbors(nNeighbors)                           #5
        neigh.fit(VT)                                                  #6
        act = pd.Series()
        pred= pd.Series()
        for i in range(TT.shape[0]):                                   #7
            d = neigh.kneighbors(tt[i,:], return_distance=True)
            W = pd.Series([v for v in d[0][0]])
            y = pd.Series(pubsums.iloc[d[1][0]].CTR)
            act.append(pd.Series(tsums.iloc[i].CTR))
            pred.append(pd.Series(np.average(y, weights = wf(W))))
    mse = act.sub(pred).pow(2).mean() / (pred.max() - pred.min())
    mseL.append(rmse)
    plt.subplot(130+idx+1)
    plt.plot(range(1,20,1), mseL)
    plt.tight_layout(pad=2.0)
#1 - Equal weights
#2 - 1/dist
#3 - 1/dist squared
#4 - For each of the three weighting schemes, computes predicted target values for k = 1 through 20
#5 - Initializes
#6 - Finds k-nearest neighbors; VT is user/item transposed of the training set
#7 - TT is user/item transposed of the test set

10.9. Random forests

In the training phase of random forests, data is sampled repeatedly, with replacement, in a process called bagging, sometimes called bootstrap aggregating. For each sample, a decision tree is constructed using a randomly selected subset of the features. To make predictions on unseen data, each decision tree is evaluated independently, and the results are averaged (for regression) or each tree “votes” for classification. For many applications, random forests may be outperformed by other algorithms such as boosted trees or support vector machines, but random forests have the advantages that they’re easy to apply, their results are easy to interpret and understand, and the training of many trees is easily parallelized. Once again, you’ll use scikit-learn; see figure 10.7.

Figure 10.7. Variable importance for the random forest regression

Listing 10.5. Random forest regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22from sklearn.ensemble import RandomForestRegressor
from sklearn import cross_validation
features = ['exposure', 'meanViewTime', 'nImps', 'reach', 'reachRate',     #1
                    'vImps', 'vRate', 'vReach', 'vReachRate']
X_train, X_test, y_train, y_test =  cross_validation.train_test_split(     #2
              df[features], df.CTR, test_size=0.40, random_state=0)
reg = RandomForestRegressor(n_estimators=100, n_jobs=-1)                   #3
model = reg.fit(X_train, y_train)
scores = cross_validation.cross_val_score(model, X_train, y_train)         #4
print(scores, scores.mean())
([ 0.62681533,  0.66944703,  0.63701492]), 0.64442575999999996)
model.score(X_test, y_test)                                                #5
0.6135074515145226
plt.rcParams["figure.figsize"] = [12.0, 4.0]
plt.bar(range(len(features)), model.feature_importances_, align='center')
_ = plt.xticks(range(len(features)), features)
#1 - features are simple aggregates by pub
#2 - Splits data into test and train, features and targets; trains on 60% of the data, holds out 40% for test
#3 - Runs the random forest regression with 100 trees; n_jobs parameter tells RF to use all available cores
#4 - Cross-validation splits training set to evaluate the model
#5 - Runs the model on the test set

The optimized random forest regression provides a useful prediction of CTR, but it’s not as good as the KNN prediction. Your next steps might be to explore ways to combine these, and possibly other, models. Methods that combine models in this way are called ensemble methods. Random forests are, in their own right, an ensemble method, as bagging is a way of generating multiple models. To combine entirely different models such as the two in this example, you might employ stacking, or stacked generalization, in which the predictions from multiple models become features that are combined by training and prediction using yet another ML model, usually logistic regression.

10.10. Other real-world considerations

You looked at the real-world issues that come with big data: high dimensionality, computing resources, storage, and network data transfer constraints. As we mentioned briefly, the entire process may be replicated for several species of digital ads: mobile, video, and native. Real-time bidding and user-level personalization have an entirely different set of concerns. The data at your disposal may vary widely from one program to the next, and the models that work perfectly in one situation may fail entirely for another.

In our example, we had a large historical dataset to start with. But our recommender-like approach has an issue known as the cold-start problem. When a new user or a new product enters the system with no history to rely on, you have no basis for building associations. For our purposes, a few unknowns don’t matter, but when a new campaign starts from scratch, you have no history at all to work with. Models built on the basis of other similar campaigns may or may not be effective.

In the real world, there’s a great advantage to having a variety of tools and models that can be employed. The larger and more complex the environment, the greater the benefit of having such a suite of feature-building, data-reduction, training, prediction, and assessment tools well organized and built into a coherent automated workflow.

Advertising is a great example of a business in which externalities may diminish the effectiveness of your predictive models. As technology and business practices change, behaviors change. The growth of mobile devices has changed the digital landscape dramatically. Real-time bidding completely changes the level on which you apply optimization. New forms of fraud, ad blockers, new browsers, and new web technology all change the dynamics that you’re modeling. In the real world, models are built, tested, deployed, measured, rebuilt, retested, redeployed, and measured again.

Digital advertising is a multibillion-dollar business, and for the brands that rely on it, optimizations that reduce wasted expenditures, even a little, can have a significant return on investment. Each wasted impression you can eliminate saves money, but when replaced with one that results in gaining a customer, the benefit will be far greater than the cost savings—and will more than justify the effort to overcome the many challenges of this dynamic business.

10.11. Summary

This chapter covered elements of a real-world machine-learning problem somewhat more broadly than just choosing algorithms, training, and testing models. Although these are the heart of the discipline of machine learning, their success often depends on surrounding practicalities and trade-offs. Here are some of the key points from this chapter’s example:

The first step is always to understand the business or activity you’re modeling, its objectives, and how they’re measured. It’s also important to consider how your predictions can be acted on—to anticipate what adjustments or optimizations can be made based on the insight you deliver.
Different feature-engineering strategies may yield very different working datasets. Casting a wide net and considering a range of possibilities can be beneficial. In the first model, you expanded the feature set vastly and then reduced it using SVD. In the second, you used simple aggregations. Which approach works best depends on the problem and the data.
After exploring a subsample of data, you were able to estimate the computing resources needed to perform your analyses. In our example, the bottleneck wasn’t the ML algorithms themselves, but rather the collection and aggregation of raw data into a form suitable for modeling. This isn’t unusual, and it’s important to consider both prerequisite and downstream workflow tasks when you consider resource needs.
Often, the best model isn’t a single model, but an ensemble of models, the predictions of which are aggregated by yet another predictive model. In many real-world problems, practical trade-offs exist between the best possible ensembles and the practicality of creating, operating, and maintaining complex workflows.
In the real world, there are often a few, and sometimes many, variations on the problem at hand. We discussed some of these for advertising, and they’re common in any complex discipline.
The underlying dynamics of the phenomena you model often aren’t constant. Business, markets, behaviors, and conditions change. When you use ML models in the real world, you must constantly monitor their performance and sometimes go back to the drawing board.

10.12. Terms from this chapter

Word	Definition
recommender	A class of ML algorithms used to predict users’ affinities for various items.
collaborative filtering	Recommender algorithms that work by characterizing users via their item preferences, and items by the preferences of common users.
ensemble method	An ML strategy in which multiple models’ independent predictions are combined.
ensemble effect	The tendency of multiple combined models to yield better predictive performance than the individual components.
k-nearest neighbors	An algorithm that bases predictions on the nearest observations in the training space.
Euclidean distance	One of many ways of measuring distances in feature space. In two-dimensional space, it’s the familiar distance formula.
random forest	An ensemble learning method that fits multiple decision tree classifiers or regressors to subsets of the training data and features and makes predictions based on the combined model.
bagging	The process of repeated sampling with replacement used by random forests and other algorithms.
stacking	Use of a machine-learning algorithm, often logistic regression, to combine the predictions of other algorithms to create a final “consensus” prediction.

10.13. Recap and conclusion

The first goal in writing this book was to explain machine learning as it’s practiced in the real world, in an understandable and interesting way. Another was to enable you to recognize when machine learning can solve your real-world problems. Here are some of the key points:

Machine-learning methods are truly superior for certain data-driven problems.
A basic machine-learning workflow includes data preparation, model building, model evaluation, optimization, and prediction.
Data preparation includes ensuring that a sufficient quantity of the right data has been collected, visualizing the data, exploring the data, dealing with missing data, recoding categorical features, performing feature engineering, and always watching out for bias.
Machine learning uses many models. Broad classes are linear and nonlinear, parametric and nonparametric, supervised and unsupervised, and classification and regression.
Model evaluation and optimization involves iterative cross-validation, performance measurement, and parameter tuning.
Feature engineering enables application of domain knowledge and use of unstructured data. It can often improve the performance of models dramatically.
Scale isn’t just about big data. It involves the partitioning of work, the rate at which new data is ingested, training time, and prediction time, all in the context of business or mission requirements.

The mathematics and computer science of machine learning have been with us for 50 years, but until recently they were confined to academia and a few esoteric applications. The growth of giant internet companies and the propagation of data as the world has gone online have opened the floodgates. Businesses, governments, and researchers are discovering and developing new applications for machine learning every day. This book is primarily about these applications, with just enough of the foundational mathematics and computer science to explain not just what practitioners do, but how they do it. We’ve emphasized the essential techniques and processes that apply regardless of the algorithms, scale, or application. We hope we’ve helped to demystify machine learning and in so doing helped to advance its use to solve important problems.

Progress comes in waves. The computer automation wave changed our institutions. The internet tidal wave changed our lives and our culture. There are good reasons to expect that today’s machine learning is but a preview of the next wave. Will it be a predictable rising tide, a rogue wave, or a tsunami? It’s too soon to say, but adoption isn’t just proceeding; it’s accelerating. At the same time, advances in machine-learning tools are impressive, to say the least. Computer systems are advancing in entirely new ways as we program them to learn progressively more-abstract skills. They’re learning to see, hear, speak, translate languages, drive our cars, and anticipate our needs and desires for goods, services, knowledge, and relationships.

Arthur C. Clark said that any sufficiently advanced technology is indistinguishable from magic (Clark’s third law). When machine learning was first proposed, it did sound like magic. But as it has become more commonplace, we’ve begun to understand it as a tool. As we see many examples of its application, we can generalize (in the human sense) and imagine other uses without knowing all the details of its internal workings. Like other advanced technologies that were once seen as magic, machine learning is coming into focus as a natural phenomenon, in the end more subtle and beautiful than magic.