Chapter 7. Exploring your data with aggregations

published book

This chapter covers

  • Metrics aggregations
  • Single and multi-bucket aggregations
  • Nesting aggregations
  • Relations among queries, filters, and aggregations

So far in this book, we’ve concentrated on the use case of indexing and searching: you have many documents and the user wants to find the most relevant matches to some keywords. There are more and more use cases where users aren’t interested in specific results. Instead, they want to get statistics from a set of documents. These statistics might be hot topics for news, revenue trends for different products, the number of unique visitors to your website, and much more.

Aggregations in Elasticsearch solve this problem by loading the documents matching your search and doing all sorts of computations, such as counting the terms of a string field or calculating the average on a numeric field. To look at how aggregations work, we’ll use an example from the get-together site you’ve worked with in previous chapters: a user entering your site may not know what groups to look for. To give the user something to start with, you could make the UI show the most popular tags for existing groups of your get-together site, as illustrated in figure 7.1.

Figure 7.1. Example use case of aggregations: top tags for get-together groups

Those tags would be stored in a separate field of your group documents. The user could then select a tag and filter down to only documents containing that tag. This makes it easier for users to find groups relevant to their interests.

To get such a list of popular tags in Elasticsearch, you’d use aggregations, and in this specific case, you’d use the terms aggregation on the tags field, which counts occurrences of each term in that field and returns the most frequent terms. Many other types of aggregations are also available, and we’ll discuss them later in this chapter. For example, you can use a date_histogram aggregation to show how many events happened in each month of the last year, use the avg aggregation to show you the average number of attendees for each event, or even find out which users have similar taste for events as you do by using the significant_terms aggregation.


What about facets?

If you’ve used Lucene, Solr, or even Elasticsearch for some time, you might have heard about facets. Facets are similar to aggregations, because they also load the documents matching your query and perform computations in order to return statistics. Facets are still supported in versions 1.x but are deprecated and will be removed in version 2.0.

The main difference between aggregations and facets is that you can’t nest multiple types of facets in Elasticsearch, which limits the possibilities for exploring your data. For example, if you had a blogging site, you could use the terms facet to find out the hot topics this year, or you could use the date histogram facet to find out how many articles are posted each day, but you couldn’t find the number of posts per day, separately for each topic (at least not in one request). You’d be able to do that if you could nest the date histogram facet under the terms facet.

Aggregations were born to remove this limit and allow you to get deeper insights from your documents. For example, if you store your online shop logs in Elasticsearch, you can use aggregations to find not only the best-selling products but also the best-selling products in each country, the trends for each product in each country, and so on.


In this chapter, we’ll first discuss the common traits of all aggregations: how you run them and how they relate to the queries and filters you learned in previous chapters. Then we’ll dive into the particularities of each type of aggregation, and in the end, we’ll show you how to combine different aggregation types.

Aggregations are divided in two main categories: metrics and bucket. Metrics aggregations refer to the statistical analysis of a group of documents, resulting in metrics such as the minimum value, maximum value, standard deviation, and much more. For example, you can get the average price of items from an online shop or the number of unique users logging on to it.

Bucket aggregations divide matching documents into one or more containers (buckets) and then give you the number of documents in each bucket. The terms aggregation, which would give you the most popular tags in figure 7.1, makes a bucket of documents for each tag and gives you the document count for each bucket.

Within a bucket aggregation, you can nest other aggregations, making the sub-aggregation run on each bucket of documents generated by the top-level aggregation. You can see an example in figure 7.2.

Figure 7.2. The terms bucket aggregation allows you to nest other aggregations within it.

Looking at the figure from the top down, you can see that if you’re using the terms aggregation to get the most popular group tags, you can also get the average number of members for groups matching each tag. You could also ask Elasticsearch to give you, per tag, the number of groups created per year.

As you may imagine, you can combine many types of aggregations in many ways. To get a better view of the available options, we’ll go through metrics and bucket aggregations and then discuss how you can combine them. But first, let’s see what’s common for all types of aggregations: how to write them and how they relate to your queries.

join today to enjoy all our content. all the time.
 

7.1. Understanding the anatomy of an aggregation

All aggregations, no matter their type, follow some rules:

  • You define them in the same JSON request as your queries, and you mark them by the key aggregations, or aggs. You need to give each one a name and specify the type and the options specific to that type.
  • They run on the results of your query. Documents that don’t match your query aren’t accounted for unless you include them with the global aggregation, which is a bucket aggregation that will be covered later in this chapter.
  • You can further filter down the results of your query, without influencing aggregations. To do that, we’ll show you how to use post filters. For example, when searching for a keyword in an online shop, you can build statistics on all items matching the keyword but use post filters to show only results that are in stock.

Let’s take a look at the popular terms aggregation, which you’ve already seen in the intro to this chapter. The example use case was getting the most popular subjects (tags) for existing groups of your get-together site. We’ll use this same terms aggregation to explore the rules that all aggregations must follow.

7.1.1. Structure of an aggregation request

In listing 7.1, you’ll run a terms aggregation that will give you the most frequent tags in the get-together groups. The structure of this terms aggregation will apply to every other aggregation.


Note

For this chapter’s listing to work, you’ll need to index the sample dataset from the code samples that come with the book, located at https://github.com/dakrone/elasticsearch-in-action.


Listing 7.1. Using the terms aggregation to get top tags
  • At the top level there’s the aggregations key, which can be shortened to aggs.
  • On the next level, you have to give the aggregation a name. You can see that name in the reply. This is useful when you use multiple aggregations in the same request, so you can easily see the meaning of each set of results.
  • Finally, you have to specify the aggregation type terms and the specific option. In this case, you’ll have the field name.

The aggregation request from listing 7.1 hits the _search endpoint, just like the queries you’ve seen in previous chapters. In fact, you also get back 10 group results. This is all because no query was specified, which will effectively run the match_all query you saw in chapter 4, so your aggregation will run on all the group documents. Running a different query will make the aggregation run through a different set of documents. Either way, you get 10 such results because size defaults to 10. As you saw in chapters 2 and 4, you can change size from either the URI or the JSON payload of your query.


Field data and aggregations

When you run a regular search, it goes fast because of the nature of the inverted index: you have a limited number of terms to look for, and Elasticsearch will identify documents containing those terms and return the results. An aggregation, on the other hand, has to work with the terms of each document matching the query. It needs a mapping between document IDs and terms—opposite of the inverted index, which maps terms to documents.

By default, Elasticsearch un-inverts the inverted index into field data, as we explained in chapter 6, section 6.10. The more terms it has to deal with, the more memory the field data will use. That’s why you have to make sure you give Elasticsearch a large enough heap, especially when you’re doing aggregations on large numbers of documents or if you’re analyzing fields and you have more than one term per document. For not_analyzed fields, you can use doc values to have this un-inverted data structure built at index time and stored on disk. More details about field data and doc values can be found in chapter 6, section 6.10.


7.1.2. Aggregations run on query results

Computing metrics over the whole dataset is just one of the possible use cases for aggregations. Often you want to compute metrics in the context of a query. For example, if you’re searching for groups in Denver, you probably want to see the most popular tags for those groups only. As you’ll see in the next listing, this is the default behavior for aggregations. Unlike in listing 7.1, where the implied query was match_all, in the following listing you query for “Denver” in the location field, and aggregations will only be about groups from Denver.

Listing 7.2. Getting top tags for groups in Denver

Recall from chapter 4 that you can use the from and size parameters of your query to control the pagination of results. These parameters have no influence on aggregations because aggregations always run on all the documents matching a query.

If you want to restrict query results more without also restricting aggregations, you can use post filters. We’ll discuss post filters and the relationship between filters and aggregations in general in the next section.

7.1.3. Filters and aggregations

In chapter 4 you saw that for most query types there’s a filter equivalent. Because filters don’t calculate scores and are cacheable, they’re faster than their query counterparts. You’ve also learned that you should wrap filters in a filtered query, like this:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{
"query": {
  "filtered": {
    "filter": {
      "term": {
        "location": "denver"
      }
    }
  }
}}'

Using the filter this way is good for overall query performance because the filter runs first. Then the query—which is typically more performance-intensive—runs only on documents matching the filter. As far as aggregations are concerned, they run only on documents matching the overall filtered query, as shown in figure 7.3.

Figure 7.3. A filter wrapped in a filtered query runs first and restricts both results and aggregations.

“Nothing new so far,” you might say. “The filtered query behaves like any other query when it comes to aggregations,” and you’d be right. But there’s also another way of running filters: by using a post filter, which will run after the query and independent of the aggregation. The following request will give the same results as the previous filtered query:

% curl 'localhost:9200/get-together/group/_search?pretty' -d '{
"post_filter": {
  "term": {
    "location": "denver"
  }
}}'

As illustrated in figure 7.4, the post filter differs from the filter in the filtered query in two ways:

Figure 7.4. Post filter runs after the query and doesn’t affect aggregations.
  • Performance— The post filter runs after the query, making sure the query will run on all documents, and the filter runs only on those documents matching the query. The overall request is typically slower than the filtered query equivalent, where the filter runs first.
  • Document set processed by aggregations— If a document doesn’t match the post filter, it will still be accounted for by aggregations.

Now that you understand the relationships between queries, filters, and aggregations, as well as the overall structure of an aggregation request, we can dive deeper into Aggregations Land and explore different aggregation types. We’ll start with metrics aggregations and then go to bucket aggregations, and then we’ll discuss how to combine them to get powerful insights from your data in real time.

Get Elasticsearch in Action
add to cart

7.2. Metrics aggregations

Metrics aggregations extract statistics from groups of documents, or, as we’ll explore in section 7.4, buckets of documents coming from other aggregations.

These statistics are typically done on numeric fields, such as the minimum or average price. You can get each such statistic separately or you can get them together via the stats aggregation. More advanced statistics, such as the sum of squares or the standard deviation, are available through the extended_stats aggregation.

For both numeric and non-numeric fields you can get the number of unique values using the cardinality aggregation, which will be discussed in section 7.2.3.

7.2.1. Statistics

We’ll begin looking at metrics aggregations by getting some statistics on the number of attendees for each event.

From the code samples, you can see that event documents contain an array of attendees. You can calculate the number of attendees at query time through a script, which we’ll show in listing 7.3. We discussed scripting in chapter 3, when you used scripts for updating documents. In general, with Elasticsearch queries you can build a script field, where you put a typically small piece of code that returns a value for each document. In this case, the value will be the count of elements of the attendees array.


The flexibility of scripts comes with a price

Scripts are flexible when it comes to querying, but you have to be aware of the caveats in terms of performance and security.

Even though most aggregation types allow you to use them, scripts slow down aggregations because they have to be run on every document. To avoid the need of running a script, you can do the calculation at index time. In this case, you can extract the number of attendees for every event and add it to a separate field before indexing it. We’ll talk more about performance in chapter 10.

In most Elasticsearch deployments, the user specifies a query string, and it’s up to the server-side application to construct the query out of it. But if you allow users to specify any kind of query, including scripts, someone might exploit this and run malicious code. That’s why, depending on your Elasticsearch version, running scripts inline like in listing 7.3 (called dynamic scripting) is disabled. To enable it, set script.disable_dynamic: false in elasticsearch.yml.


In the following listing, you’ll request statistics on the number of attendees for all events. To get the number of attendees in the script, you’ll use doc['attendees'].values to get the array of attendees. Adding the length property to that will return their number.

Listing 7.3. Getting stats for the number of event attendees

You can see that you get back the minimum number of attendees per event, as well as the maximum, the sum, and the average. You also get the number of documents these statistics were computed on.

If you need only one of those statistics, you can get it separately. For example, you’ll calculate the average number of attendees per event through the avg aggregation in the next listing.

Listing 7.4. Getting the average number of event attendees
URI=localhost:9200/get-together/event/_search
curl "$URI?pretty&search_type=count" -d '{
"aggregations": {
  "attendees_avg": {
    "avg": {
      "script": "doc['"'attendees'"'].values.length"
    }
  }
}}'
### reply
[...]
  "aggregations" : {
    "attendees_avg" : {
      "value" : 3.8666666666666667
    }
  }
}

Similar to the avg aggregation, you can get the other metrics through the min, max, sum, and value_count aggregations. You’d have to replace avg from listing 7.4 with the needed aggregation name. The advantage of separate statistics is that Elasticsearch won’t spend time computing metrics that you don’t need.

7.2.2. Advanced statistics

In addition to statistics gathered by the stats aggregation, you can get the sum of squares, variance, and standard deviation of your numeric field by running the extended_stats aggregation, as shown in the next listing.

Listing 7.5. Getting extended statistics on the number of attendees
URI=localhost:9200/get-together/event/_search
curl "$URI?pretty&search_type=count" -d '{
"aggregations": {
  "attendees_extended_stats": {
    "extended_stats": {
      "script": "doc['"'attendees'"'].values.length"
    }
  }
}}'
### reply
  "aggregations" : {
    "attendees_extended_stats" : {
      "count" : 15,
      "min" : 3.0,
      "max" : 5.0,
      "avg" : 3.8666666666666667,
      "sum" : 58.0,
      "sum_of_squares" : 230.0,
      "variance" : 0.38222222222222135,
      "std_deviation" : 0.6182412330330462
    }
  }

All these statistics are calculated by looking at all the values in the document set matching the query, so they’re 100% accurate all the time. Next we’ll look at some statistics that use approximation algorithms, trading some of the accuracy for speed and less memory consumption.

7.2.3. Approximate statistics

Some statistics can be calculated with good precision—though not 100%—by looking at some of the values from your documents. This will limit both their execution time and their memory consumption.

Here we’ll look at how to get two types of such statistics from Elasticsearch: percentiles and cardinality. Percentiles are values below which you can find x% of the total values, where x is the given percentile. This is useful, for example, when you have an online shop: you log the value of each shopping cart and you want to see in which price range are most shopping carts. Perhaps most of your users only buy an item or two, but the upper 10% buy a lot of items and generate most of your revenue.

Cardinality is the number of unique values in a field. This is useful, for example, when you want the number of unique IP addresses accessing your website.

Percentiles

For percentiles, think about the number of attendees for events once again and determine the maximum number of attendees you’ll consider normal and the number you’ll consider high. In listing 7.6, you’ll calculate the 80th percentile and the 99th. You’ll consider numbers under the 80th to be normal and numbers under the 99th high, and you’ll ignore the upper 1%, because they’re exceptionally high.

To accomplish this, you’ll use the percentiles aggregation, and you’ll set the percents array to 80 and 99 in order to get these specific percentiles.

Listing 7.6. Getting the 80th and the 99th percentiles from the number of attendees

For small data sets like the code samples, you have 100% accuracy, but this may not happen with large data sets in production. With the default settings, you have over 99.9% accuracy for most data sets for most percentiles. The specific percentile matters, because accuracy is at its worst for the 50th percentile, and as you go toward 0 or 100 it gets better and better.

You can trade memory for accuracy by increasing the compression parameter from the default 100. Memory consumption increases proportionally to the compression, which in turn controls how many values are taken into account when approximating percentiles.

There’s also a percentile_ranks aggregation that allows you to do the opposite—specify a set of values—and you’ll get back the corresponding percentage of documents having up to those values:

% curl "$URI?pretty&search_type=count" -d '{
"aggregations": {
  "attendees_percentile_ranks": {
    "percentile_ranks": {
      "script": "doc['"'attendees'"'].values.length",
      "values": [4, 5]
    }
  }
}}'


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":3,\"ch\":4},{\"line\":3,\"ch\":25}],[{\"line\":5,\"ch\":6},{\"line\":5,\"ch\":22}]]"}
!@%STYLE%@!
Cardinality

For cardinality, let’s imagine you want the number of unique members of your get-together site. The following listing shows how to do that with the cardinality aggregation.

Listing 7.7. Getting the number of unique members through the cardinality aggregation
URI=localhost:9200/get-together/group/_search
curl "$URI?pretty&search_type=count" -d '{
"aggregations": {
  "members_cardinality": {
    "cardinality": {
      "field": "members"
    }
  }
}}'
### reply
  "aggregations" : {
    "members_cardinality" : {
      "value" : 8
    }
  }

Like the percentiles aggregation, the cardinality aggregation is approximate. To understand the benefit of such approximation algorithms, let’s take a closer look at the alternative. Before the cardinality aggregation was introduced in version 1.1.0, the common way to get the cardinality of a field was by running the terms aggregation you saw in section 7.1. Because the terms aggregation will get the counts of each term for top N terms, where N is the configurable size parameter, if you specify a size large enough, you could get all the unique terms back. Counting them will give you the cardinality.

Unfortunately, this approach only works for fields with relatively low cardinality and a low number of documents. Otherwise, running a terms aggregation with a huge size requires a lot of resources:

  • Memory— All the unique terms need to be loaded in memory in order to be counted.
  • CPU— Those terms have to be returned in order; by default the order is on how many times each term occurs.
  • Network— From each shard, the large array of sorted unique terms has to be transferred to the node that received the client request. That node also has to merge per-shard arrays into one big array and transfer it back to the client.

This is where approximation algorithms come into play. The cardinality field works with an algorithm called HyperLogLog++ that hashes values from the field you want to examine and uses the hashes to approximate the cardinality. It loads only some of those hashes into memory at once, so the memory usage will be constant no matter how many terms you have.


Note

For more details on the HyperLogLog++ algorithm, have a look at the original paper from Google: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf.


Memory and cardinality

We said that the memory usage of the cardinality aggregation is constant, but how large would that constant be? You can configure it through the precision_threshold parameter. The higher the threshold, the more precise the results, but more memory is consumed. If you run the cardinality aggregation on its own, it will take about precision_threshold times 8 bytes of memory for each shard that gets hit by the query.

The cardinality aggregation, like all other aggregations, can be nested under a bucket aggregation. When that happens, the memory usage is further multiplied by the number of buckets generated by the parent aggregations.


Tip

For most cases, the default precision_threshold will work well, because it provides a good tradeoff between memory usage and accuracy, and it adjusts itself depending on the number of buckets.


Next, we’ll look at the choice of multi-bucket aggregations. But before we go there, table 7.1 gives you a quick overview of each metrics aggregation and the typical use case.

Table 7.1. Metrics aggregations and typical use cases

Aggregation type

Example use case

stats Same product sold in multiple stores. Gather statistics on the price: how many stores have it and what the minimum, maximum, and average prices are.
individual stats (min, max, sum, avg, value_count) Same product sold in multiple stores. Show “prices starting from” and then the minimum price.
extended_stats Documents contain results from a personality test. Gather statistics from that group of people, such as the variance and the standard deviation.
percentiles Access times on your website: what the usual delays are and how long the longest response times are.
percentile_ranks Checking if you meet SLAs: if 99% of requests have to be served under 100ms, you can check what’s the actual percentage.
cardinality Number of unique IP addresses accessing your service.
Sign in for more free preview time

7.3. Multi-bucket aggregations

As you saw in the previous section, metrics aggregations are about taking all your documents and generating one or more numbers that describe them. Multi-bucket aggregations are about taking those documents and putting them into buckets—like the group of documents matching each tag. Then, for each bucket, you’ll get one or more numbers that describe the bucket, such as counting the number of groups for each tag.

So far you’ve run metrics aggregations on all documents matching the query. You can think of those documents as one big bucket. Other aggregations generate such buckets: for example, if you’re indexing logs and have a country code field, you can do a terms aggregation on it to create one bucket of documents for each country. As you’ll see in section 7.4, you can nest aggregations: for example, a cardinality aggregation could run on the buckets created by the terms aggregation to give you the number of unique visitors per country.

For now, let’s see what kinds of multi-bucket aggregations are available and where they’re typically useful:

  • Terms aggregations let you figure out the frequency of each term in your documents. There’s the terms aggregation, which you’ve seen a couple of times already, that gives you back the number of times each term appears. It’s useful for figuring out things like frequent posters on a blog or popular tags. There’s also the significant_terms aggregation, which gives you back the difference between the occurrence of a term in the whole index and its occurrence in your query results. This is useful for suggesting terms that are significant for the search context, like “elasticsearch” would be for the context of “search engine.”
  • Range aggregations create buckets based on how documents fall into which numerical, date, or IP address range. This is useful when analyzing data where the user has fixed expectations. For example, if someone is searching for a laptop in an online shop, you know the price ranges that are most popular.
  • Histogram aggregations, either numerical or date, are similar to range aggregations, but instead of requiring you to define each range, you have to define an interval, and Elasticsearch will build buckets based on that interval. This is useful when you don’t know where the user is likely to look. For example, you could show a chart of how many events occur each month.
  • Nested, reverse nested, and children aggregations allow you to perform aggregations across document relationships. We’ll discuss them in chapter 8 when we talk about nested and parent-child relations.
  • Geo distance and geohash grid aggregations allow you to create buckets based on geolocation. We’ll show them in appendix A, which is focused on geo search.

Figure 7.5 shows an overview of the types of multi-bucket aggregations we’ll discuss here.

Figure 7.5. Major types of multi-bucket aggregations

Next, let’s zoom into each of these multi-bucket aggregations and see how you can use them.

7.3.1. Terms aggregations

We first looked at the terms aggregation in section 7.1 as an example of how all aggregations work. The typical use case is to get the top frequent X, where X would be a field in your document, like the name of a user, a tag, or a category. Because the terms aggregation counts every term and not every field value, you’ll normally run this aggregation on a non-analyzed field, because you want “big data” to be counted once and not once for “big” and once for “data.”

You could use the terms aggregation to extract the most frequent terms from an analyzed field, like the description of an event. You can use this information to generate a word cloud, like the one in figure 7.6. Just make sure you have enough memory for loading all the fields in memory if you have many documents or the documents contain many terms.

Figure 7.6. A terms aggregation can be used to get term frequencies and generate a word cloud.

By default, the order of terms is by their count, descending, which fits all the top frequent X use cases. But you can order terms ascending, or by other criteria, such as the term name itself. The following listing shows how to list the group tags ordered alphabetically by using the order property.

Listing 7.8. Ordering tag buckets by name

If you’re nesting a metric aggregation under your terms aggregation, you can order terms by the metric, too. For example, you could use the average metric aggregation under your tags aggregation from listing 7.8 to get the average number of group members per tag. And you can order tags by the number of members by referring your metric aggregation name, like avg_members: desc (instead of _term: asc as in listing 7.8).

Which terms to include in the reply

By default, the terms aggregation will return only the top 10 terms by the order you selected. You can, however, change that number though the size parameter. Setting size to 0 will get you all the terms, but it’s dangerous to use with a high-cardinality field, because returning a very large result is CPU-intensive to sort and might saturate your network.

To get back the top 10 terms—or the number of terms you configure with size—Elasticsearch has to get a number of terms (configurable through shard_size) from each shard and aggregate the results. The process is shown in figure 7.7, with shard_ size and size set to 2 for clarity.

Figure 7.7. Sometimes the overall top X is inaccurate, because only the top X terms are returned per shard.

This mechanism implies that you might get inaccurate counters for some terms if those terms don’t make it to the top of each individual shard. This can even result in missing terms, like in figure 7.7 where lucene, with a total value of 7, isn’t returned in the top 2 overall tags because it didn’t make the top 2 for each shard.

You can get more accurate results by setting a large shard_size, as shown in figure 7.8. But this will make aggregations more expensive (especially if you nest them) because there are more buckets that need to be kept in memory.

Figure 7.8. Reducing inaccuracies by increasing shard_size

To get an idea of how accurate results are, you can check the values at the beginning of the aggregation response:

"tags" : {
  "doc_count_error_upper_bound" : 0,
  "sum_other_doc_count" : 6,

The first number is the worst-case scenario error margin. For example, if the minimum count for a term returned by a shard is 5, it could be that a term occurring four times in that shard has been missed. If that term should have appeared in the final results, that’s a worst-case error of 4. The total of these numbers for all shards makes up doc_count_error_upper_bound. For our code samples, that number is always 0, because we have only one shard—the top terms for that shard are the same as the global top terms.

The second number is the total count of the terms that didn’t make the top.

You can get a doc_count_error_upper_bound value for each term by setting show_term_doc_count_error to true. This will take the worst-case scenario error per term: for example if “big data” is returned by a shard, you know that it’s the exact value. But if another shard doesn’t return “big data” at all, the worst-case scenario is that “big data” actually exists with a value just below the last returned term. Adding up these error numbers for shards not returning that term make up doc_count_error _upper_bound per term.

At the other end of the accuracy spectrum, you could consider terms with low frequency irrelevant and exclude them from the result set entirely. This is especially useful when you sort terms by something other than frequency, which makes it likely that low-frequency terms will appear, but you don’t want to pollute the results with irrelevant results like typos. To do that, you’ll need to change the min_doc_count setting from the default value of 1. If you want to cut these low-frequency terms at the shard level, you use shard_min_doc_count.

Finally, you can include and exclude specific terms from the result. You’d do that by using the include and exclude options and providing regular expressions as values. Using include alone will include only terms matching the pattern; using exclude alone will include terms that don’t match. Using both will have exclude take precedence: included terms will match the include pattern but won’t match the exclude pattern.

The following listing shows how to only return counters for tags containing “search.”

Listing 7.9. Creating buckets only for terms containing “search”
URI=localhost:9200/get-together/group/_search
curl "$URI?pretty&search_type=count" -d '{
"aggregations": {
  "tags": {
    "terms": {
      "field": "tags.verbatim",
      "include": ".*search.*"

    }
  }
}}'
### reply
  "aggregations" : {
    "tags" : {
      "buckets" : [ {
        "key" : "elasticsearch",
        "doc_count" : 2
      }, {
        "key" : "enterprise search",
        "doc_count" : 1


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":6,\"ch\":6},{\"line\":6,\"ch\":29}]]"}
!@%STYLE%@!

Collect mode

By default, Elasticsearch does all aggregations in a single pass. For example, if you had a terms aggregation and a cardinality aggregation nested in it, Elasticsearch would make a bucket for each term, calculate the cardinality for each bucket, sort those buckets, and return the top X.

This works well for most use cases, but it will take lots of time and memory if you have lots of buckets and lots of sub-aggregations, especially if a sub-aggregation is also a multi-bucket aggregation with lots of buckets. In such cases, a two-pass approach will be better: first create the buckets of the top-level aggregation, sort and cache the top X, and then calculate sub-aggregations on only those top X.

You can control which approach Elasticsearch uses by setting collect_mode. The default is depth_first, and the two-pass approach is breadth_first.


Significant terms

The significant_terms aggregation is useful if you want to see which terms have higher frequencies than normal in your current search results. Let’s take the example of get-together groups: in all the groups out there, the term clojure may not appear frequently enough to count. Let’s assume that it appears 10 times out of 1,000,000 terms (0.001%). If you restrict your search for Denver, let’s say it appears 7 times out of 10,000 terms (0.07%). The percentage is significantly higher than before and indicates a strong Clojure community in Denver, compared to the rest of the search area. It doesn’t matter that other terms such as programming or devops have a much higher absolute frequency.

The significant_terms aggregation is much like the terms aggregation in the sense that it’s counting terms. But the resulting buckets are ordered by a score, which represents the difference in percentage between the foreground documents (that 0.07% in the previous example) and the background documents (0.001%). The foreground documents are those matching your query, and the background documents are all the documents from the index.

In the following listing, you’ll try to find out which users of the get-together site have a similar preference to Lee for events. To do that, you’ll query for events where Lee attends and use the significant_terms aggregation to see which event attendees participate in more, compared to the overall set of events they attend.

Listing 7.10. Finding attendees attending similar events to Lee

As you might have guessed from the listing, the significant_terms aggregation has the same size, shard_size, min_doc_count, shard_min_doc_count, include, and exclude options as the terms aggregation, which lets you control the terms you get back. In addition to those, it allows you to change the background documents from all the documents in the index to only those matching a defined filter in the background_filter parameter. For example, you may know that Lee participates only in technology events, so you can filter those to make sure that events irrelevant to him aren’t taken into account.

Both the terms and significant_terms aggregations work well for string fields. For numeric fields, range and histogram aggregations are more relevant, and we’ll look at them next.

7.3.2. Range aggregations

The terms aggregation is most often used with strings, but it works with numeric values, too. This is useful when you have low cardinality, like when you want to give counts on how many laptops have two years of warranty, how many have three, and so on.

With high-cardinality fields, such as ages or prices, you’re most likely looking for ranges. For example, you may want to know how many of your users are between 18 and 39, how many are between 40 and 60, and so on. You can still do that with the terms aggregation, but it’s going to be tedious: in your application, you’d have to add up counters for ages 18, 19, and so on until you get to 39 to get the first bucket. And if you want to add sub-aggregations, like the ones you’ll see later in this chapter, things will get even more complicated.

To solve this problem for numerical values, you have the range aggregation. As the name suggests, you give the numerical ranges you want, and it will count the documents with values that fall into each bucket. You can use those counters to represent the data in a graphical way—for example, with a pie chart, as shown in figure 7.9.

Figure 7.9. range aggregations give you counts of documents for each range. This is good for pie charts.

Recall from chapter 3 that date strings are stored as type long in Elasticsearch, representing the UNIX time in milliseconds. To work with date ranges, you have a variant of the range aggregation called the date_range aggregation.

Range aggregation

Let’s get back to our get-together site example and do a breakdown of events by their number of attendees. You’ll do it with the range aggregation and give it an array of ranges. The thing to keep in mind here is that the minimum value from the range (the key from) is included in the bucket, whereas the maximum value (to) is excluded. In listing 7.11, you’ll have three categories:

  • Events with fewer than four members
  • Events with at least four members but fewer than six
  • Events with at least six members

Note

Ranges don’t have to be adjacent; they can be separated or they can overlap. In most cases it makes sense to cover all values, but you don’t need to.


Listing 7.11. Using a range aggregation to divide events by the number of attendees

You can see from the listing that you don’t have to specify both from and to for every range in the aggregation. Omitting one of these parameters will remove the respective boundary, and this enables you to search for all events with fewer than four members or with at least six.

Date range aggregation

As you might imagine, the date_range aggregation works just like the range aggregation, except you put date strings in your range definitions. And because of that, you should define the date format so Elasticsearch will know how to translate the string you give it into the numerical UNIX time, which is how date fields are stored.

In the following listing, you’ll divide events into two categories: before July 2013 and starting with July 2013. You can use a similar approach to count future events and past events, for example.

Listing 7.12. Using a date range aggregation to divide events by scheduled date

If the value of the format field looks familiar, it’s because it’s the same Joda Time annotation that you saw in chapter 3 when you defined date formats in the mapping. For the complete syntax, you can look at the DateTimeFormat documentation: http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html.

7.3.3. Histogram aggregations

For dealing with numeric ranges, you also have histogram aggregations. These are much like the range aggregations you just saw, but instead of manually defining each range, you’d define a fixed interval, and Elasticsearch would build the ranges for you. For example, if you want age groups from people documents, you can define an interval of 10 (years) and you’ll get buckets like 0–10 (excluding 10), 10–20 (excluding 20), and so on.

Like the range aggregation, the histogram aggregation has a variant that works with dates, called the date_histogram aggregation. This is useful, for example, when building histogram charts of how many emails were sent on a mailing list each day.

Histogram aggregation

Running a histogram aggregation is similar to running a range aggregation. You just replace the ranges array with an interval, and Elasticsearch will build ranges starting with the minimum value, adding the interval until the maximum value is included. For example, in the following listing, you specify an interval of 1 and show how many events have three attendees, how many have four, and how many have five.

Listing 7.13. Histogram showing the number of events for each number of attendees

Like the terms aggregation, the histogram aggregation lets you specify a min_doc _count value, which is helpful if you want buckets with few documents to be ignored. min_doc_count is also useful if you want to show empty buckets. By default, if there’s an interval between the minimum and maximum values that has no documents, that interval will be omitted altogether. Set min_doc_count to 0 and those intervals will still appear with a document count of 0.

Date histogram aggregation

As you might expect, you’d use the date_histogram aggregation like the histogram one, but you’d insert a date in the interval field. That date would be specified in the same Joda Time annotation as the date_range aggregation, with values such as 1M or 1.5h. For example, the following listing gives the breakdown of events happening in each month.

Listing 7.14. Histogram of events per month

Like the regular histogram aggregation, you can use the min_doc_count option to either show empty buckets or omit buckets containing just a few documents.

You probably noticed that the date_histogram aggregation has two things in common with all the other multi-bucket aggregations:

  • It counts documents having certain terms.
  • It creates buckets of documents falling into each category.

The buckets themselves are useful only when you nest other aggregations under a multi-bucket aggregation. This allows you to have deeper insights into your data, and we’ll look at nesting aggregations in the next section. First, take time to look at table 7.2, which gives you a quick overview of the multi-bucket aggregations and what they’re typically used for.

Table 7.2. Multi-bucket aggregations and typical use cases

Aggregation type

Example use case

terms Show top tags on a blogging site; hot topics this week on a news site.
significant_terms Identify new technology trends by looking at what’s used/downloaded a lot this month compared to overall.
range and date_range Show entry-level, medium-priced, and expensive laptops. Show archived events, events this week, upcoming events.
histogram and date_histogram Show distributions: how much people of each age exercise. Or show trends: items bought each day.

The list isn’t exhaustive, but it does include the most important aggregation types and their options. You can check the documentation[1] for a complete list. Also, geo aggregations are dealt with in appendix A, and nested and children aggregations in chapter 8.

join today to enjoy all our content. all the time.
 

7.4. Nesting aggregations

The real power of aggregations is the fact that you can combine them. For example, if you have a blog and you record each access to your posts, you can use the terms aggregation to show the most-viewed posts. But you can also nest a cardinality aggregation under this terms aggregation and show the number of unique visitors for each post; you can even change the sorting in the terms aggregation to show posts with the most unique visitors.

As you may imagine, nesting aggregations opens a whole new range of possibilities for exploring data. Nesting is the main reason aggregations emerged in Elasticsearch as a replacement for facets, because facets couldn’t be combined.

Multi-bucket aggregations are typically the point where you start nesting. For example, the terms aggregation allows you to show the top tags for get-together groups; this means you’ll have a bucket of documents for each tag. You can use sub-aggregations to show more metrics for each bucket. For example, you can show how many groups are being created each month, for each tag, as illustrated in figure 7.10.

Figure 7.10. Nesting a date histogram aggregation under a terms aggregation

Later in this section, we’ll discuss one particular use case for nesting: result grouping, which, unlike a regular search that gives you the top N results by relevance, gives you the top N results for each bucket of documents generated by the parent aggregation. Say you have an online shop and someone searches for “Windows.” Normally, relevance-sorted results will show many versions of the Windows operating system first. This may not be the best user experience, because at this point it’s not 100% clear whether the user is looking to buy a Windows operating system, some software built for Windows, or some hardware that works with Windows. This is where result grouping, illustrated in figure 7.11, comes in handy: you can show the top three results from each of the operating systems, software, and hardware categories and give the user a broader range of results. The user may also want to click on the category name to narrow the search to that category only.

Figure 7.11. Nesting the top_hits aggregation under a terms aggregation to get result grouping

In Elasticsearch, you’ll be able to get result grouping by using a special aggregation called top_hits. It retrieves the top N results, sorted by score or a criterion of your choice, for each bucket of a parent aggregation. That parent aggregation can be a terms aggregation that’s running on the category field, as suggested in the online shop example of figure 7.11; we’ll go over this special aggregation in the next section.

The last nesting use case we’ll talk about is controlling the document set on which your aggregations run. For example, regardless of the query, you might want to show the top tags for get-together groups created in the last year. To do this, you’d use the filter aggregation, which creates a bucket of documents that match the provided filter, in which you can nest other aggregations.

7.4.1. Nesting multi-bucket aggregations

To nest an aggregation within another one, you just have to use the aggregations or aggs key on the same level as the parent aggregation type and then put the sub-aggregation definition as the value. For multi-bucket aggregations, this can be done indefinitely. For example, in the following listing you’ll use the terms aggregation to show the top tags. For each tag, you’ll use the date_histogram aggregation to show how many groups were created each month, for each tag. Finally, for each bucket of such groups, you’ll use the range aggregation to show how many groups have fewer than three members and how many have at least three.

Listing 7.15. Nesting multi-bucket aggregations three times

You can always nest a metrics aggregation within a bucket aggregation. For example, if you wanted the average number of group members instead of the 0–2 and 3+ ranges that you had in the previous listing, you could use the avg or stats aggregation.

One particular type of aggregation we promised to cover in the last section is top_hits. It will get you the top N results, sorted by the criteria you like, for each bucket of its parent aggregation. Next, we’ll look at how you’ll use the top_hits aggregation to get result grouping.

7.4.2. Nesting aggregations to get result grouping

Result grouping is useful when you want to show the top results grouped by a certain category. Like in Google, when you have many results from the same site, you sometimes see only the top three or so, and then it moves on to the next site. You can always click the site’s name to get all the results from it that match your query.

That’s what result grouping is for: it allows you to give the user a better idea of what else is in there. Say you want to show the user the most recent events, and to make results more diverse you’ll show the most recent event for the most frequent attendees. You’ll do this in the next listing by running the terms aggregation on the attendees field and nesting the top_hits aggregation under it.

Listing 7.16. Using the top hits aggregation to get result grouping

At first, it may seem strange to use aggregations for getting results grouping. But now that you’ve learned what aggregations are all about, you can see that these concepts of buckets and nesting are powerful and enable you to do much more than gather some statistics on query results. The top_hits aggregation is an example of a non-statistic outcome of aggregations.

You’re not limited to only query results when you run aggregations; this is the default behavior, as you learned in section 7.1, but you can work around that if you need to. For example, let’s say that you want to show the most popular blog post tags on your blog somewhere on a sidebar. And you want to show that sidebar no matter what the user is searching for. To achieve this, you’d need to run your terms aggregation on all blog posts, independent of your query. Here’s where the global aggregation becomes useful: it produces a bucket with all the documents of your search context (the indices and types you’re searching in), making all other aggregations nested under it work with all these documents.

The global aggregation is one of the single-bucket aggregations that you can use to change the document set other aggregations run on, and that’s what we’ll explore next.

7.4.3. Using single-bucket aggregations

As you saw in section 7.1, Elasticsearch will run your aggregations on the query results by default. If you want to change this default, you’ll have to use single-bucket aggregations. Here we’ll discuss three of them:

  • global creates a bucket with all the documents of the indices and types you’re searching on. This is useful when you want to run aggregations on all documents, no matter the query.
  • filter and filters aggregations create buckets with all the documents matching one or more filters. This is useful when you want to further restrict the document set—for example, to run aggregations only on items that are in stock, or separate aggregations for those in stock and those that are promoted.
  • missing creates a bucket with documents that don’t have a specified field. It’s useful when you have another aggregation running on a field, but you want to do some computations on documents that aren’t covered by that aggregation because the field is missing. For example, you want to show the average price of items across multiple stores and also want to show the number of stores not listing a price for those items.
Global

Using your get-together site from the code samples, assume you’re querying for events about Elasticsearch, but you want to see the most frequent tags overall. For example, as we described earlier, you want to show those top tags somewhere on a sidebar, independent of what the user is searching for. To achieve this, you need to use the global aggregation, which can alter the flow of data from query to aggregations, as shown in figure 7.12.

Figure 7.12. Nesting aggregations under the global aggregation makes them run on all documents.

In the following listing you’ll nest the terms aggregation under the global aggregation to get the most frequent tags on all documents, even if the query looks for only those with “elasticsearch” in the title.

Listing 7.17. Global aggregation helps show top tags overall regardless of the query

When we say “all documents,” we mean all the documents from the search context defined in the search URI. In this case you’re searching in the group type of the get-together index, so all the groups will be taken into account. If you searched in the whole get-together index, both groups and events would be included in the aggregation.

Filter

Remember the post filter from section 7.1? It’s used when you define a filter directly in the JSON request, instead of wrapping it in a filtered query; the post filter restricts the results you get without affecting the aggregations.

The filter aggregation does the opposite: it restricts the document set your aggregations run on, without affecting the results. This is illustrated in figure 7.13.

Figure 7.13. The filter aggregation restricts query results for aggregations nested under it.

If you’re searching for events with “elasticsearch” in the title, you want to create a word cloud from words within the description, but you want to only account for documents that are recent enough—let’s say after July 1, 2013.

To do that, in the following listing you’d run a query as usual, but with aggregations. You’ll first have a filter aggregation restricting the document set to those after July 1, and under it you’ll nest the terms aggregation that generates the word-cloud information.

Listing 7.18. filter aggregation restricts the document set coming from the query

Note

There’s also a filters (plural) aggregation, which allows you to define multiple filters. It works similarly to the filter aggregation, except that it generates multiple buckets, one for each filter—like the range aggregation generates multiple buckets, one for each range. For more information about the filters aggregation, go to www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filters-aggregation.html.


Missing

Most of the aggregations we’ve looked at so far make buckets of documents and get metrics from values of a field. If a document is missing that field, it won’t be part of the bucket and it won’t contribute to any metrics.

For example, you might have a date_histogram aggregation on event dates, but some events have no date set yet. You can count them, too, through the missing aggregation:

% curl "$URI?pretty&search_type=count" -d '{
"aggregations": {
  "event_dates": {
    "date_histogram": {
      "field": "date",
      "interval": "1M"
    }
  },
  "missing_date": {
    "missing": {
      "field": "date"
    }
  }
}}'


!@%STYLE%@!
{"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":9,\"ch\":4},{\"line\":9,\"ch\":16}],[{\"line\":4,\"ch\":6},{\"line\":4,\"ch\":21}],[{\"line\":10,\"ch\":6},{\"line\":10,\"ch\":21}]]"}
!@%STYLE%@!

As with other single-bucket aggregations, the missing aggregation allows you to nest other aggregations under it. For example, you can use the max aggregation to show the maximum number of people who intend to participate in a single event that doesn’t have a date set yet.

There are other important single-bucket aggregations that we didn’t cover here, like the nested and reverse_nested aggregations, which allow you to use all the power of aggregations with nested documents.

Using nested documents is one of the ways to work with relational data in Elasticsearch. The next chapter provides all you need to know about relations among documents, including nested documents and nested aggregations.

Sign in for more free preview time

7.5. Summary

In this chapter, we covered the major aggregation types and how you can combine them to get insights about documents matching a query:

  • Aggregations help you get an overall view of query results by counting terms and computing statistics from resulting documents.
  • Aggregations are the new facets in Elasticsearch because there are more types of aggregations, and you can also combine them to get deeper insights into the data.
  • There are two main types of aggregations: bucket and metrics.
  • Metrics aggregations calculate statistics over a set of documents, such as the minimum, maximum, or average value of a numeric field.
  • Some metrics aggregations are calculated with approximation algorithms, which allows them to scale a lot better than exact metrics. The percentiles and cardinality aggregations work like this.
  • Bucket aggregations put documents into one or more buckets and return counters for those buckets—for example, the most frequent posters in a forum. You can nest sub-aggregations under bucket aggregations, making these sub-aggregations run one time for each bucket generated by the parent. You can use this nesting, for example, to get the average number of comments for blog posts matching each tag.
  • The top_hits aggregation can be used as a sub-aggregation to implement result grouping.
  • The terms aggregation is typically used for top frequent users/locations/items/... kinds of use cases. Other multi-bucket aggregations are variations of the terms aggregation, such as the significant_terms aggregation, which returns those words that appear more often in the query results than in the overall index.
  • The range and date_range aggregations are useful for categorizing numeric and date fields. The histogram and date_histogram aggregations are similar, but they use fixed intervals instead of manually defined ranges.
  • Single-bucket aggregations, such as the global, filter, filters, and missing aggregations, are used to change the document set on which other aggregations run, which defaults to the documents returned by the query.
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
Up next...
  • Objects and arrays of objects
  • Nested mapping, queries, and filters
  • Parent mapping, has_parent, and has_child queries and filters
  • Denormalization techniques
{{{UNSCRAMBLE_INFO_CONTENT}}}