Skip to main content

Elasticsearch 7 Aggregations

Abashiri, Japan

An aggregation summarizes your data as metrics, statistics, or other analytics. Aggregations help you answer questions like:

  • What’s the average load time for my website?
  • Who are my most valuable customers based on transaction volume?
  • What would be considered a large file on my network?
  • How many products are in each product category?

Elasticsearch organizes aggregations into three categories:

  • Metric aggregations that calculate metrics, such as a sum or average, from field values.
  • Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria.
  • Pipeline aggregations that take input from other aggregations instead of documents or fields.

Run an Aggregation

You can run aggregations as part of a search by specifying the search API's aggs parameter. The following search runs a terms aggregation on my-field - e.g. product rating:

GET /my-index/_search
{
"aggs": {
"my-agg-name": {
"terms": {
"field": "rating"
}
}
}
}

Aggregation results are in the response’s aggregations object:

"aggregations": {
"my-agg-name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}

You can also filter your results. For example, what is the average rating for a single product:

curl -XGET '127.0.0.1:9200/ratings/_search?size=0&pretty' -d
'{
"query": {
"match": {
"title": "My Product Title"
}
},
"aggs": {
"my-agg-name": {
"terms": {
"field": "rating"
}
}
}
}'

TV Ratings Example

As an example I am going to use my The Expanse TV Series Index that has a imdb_rating field that I can run an aggregation query against. There we joint a couple of episodes that all belonged to season 01 of this show and we were able to query all of them with:

GET /the-expanse/_search
{
"query": {
"has_parent": {
"parent_type": "season",
"query": {
"match": {
"title": "Season 01"
}
}
}
}
}

Elasticsearch 7 Aggregations

Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [imdb-rating] in order to load field data by uninverting the inverted index. Note that this can use significant memory.

I was receiving the error message above at first before I did not add mapping types to this index. Which meant that the imdb-rating field was mapped as text. I fixed this in the original article linked above and was now able to run the aggregation:

What is the Rating Distribution for Season 01 ?

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"query": {
"has_parent": {
"parent_type": "season",
"query": {
"match": {
"title": "Season 01"
}
}
}
},
"aggs": {
"imdb-ratings": {
"terms": {
"field": "imdb-rating"
}
}
}
}'

I set the imdb_rating type to float which gave me those results:

"aggregations" : {
"imdb-ratings" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 8.699999809265137,
"doc_count" : 4
},
{
"key" : 7.699999809265137,
"doc_count" : 2
},
{
"key" : 8.0,
"doc_count" : 2
},
{
"key" : 7.800000190734863,
"doc_count" : 1
},
{
"key" : 7.900000095367432,
"doc_count" : 1
}
]
}
}

It is obviously better to use "type": "integer" (as an alternative see Histograms below!) for the imdb_rating which rounds things up to 6 times 8 Star and 4 times 7 Star rating for Season 01 of the expanse:

"aggregations" : {
"average-rating" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 8,
"doc_count" : 6
},
{
"key" : 7,
"doc_count" : 4
}
]
}
}

How many 8 Star Ratings did Season 01 get ?

This time we just need to filter for imdb-rating's with an integer value of 8:

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"query": {
"match": {
"imdb-rating": 8
}
},
"aggs": {
"imdb-ratings": {
"terms": {
"field": "imdb-rating"
}
}
}
}'

And we will get the expected result - 6 episodes of season 01 of The Expanse had a rounded up IMDB rating of 8:

"aggregations" : {
"imdb-ratings" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 8,
"doc_count" : 6
}
]
}
}

What is the Average Rating of all Episodes in Season 01

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"query": {
"has_parent": {
"parent_type": "season",
"query": {
"match": {
"title": "Season 01"
}
}
}
},
"aggs": {
"average-ratings": {
"avg": {
"field": "imdb-rating"
}
}
}
}'

The average rating for season 01 of The Expanse is 7.6:

"aggregations" : {
"average-ratings" : {
"value" : 7.6
}
}

Display Ratings by 1.0 Intervals

Before I ran into the issue that I set the type of my ratings to float - instead of re-indexing the data with integer ratings I could have also have used the Histogram function that Elasticsearch provides for aggregations:

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"aggs": {
"whole-ratings": {
"histogram": {
"field": "imdb-rating",
"interval": 1.0
}
}
}
}'

Even with the ratings set to float I now get my data nicely formatted and ready to be used for a histogram visualisation:

"aggregations" : {
"whole-ratings" : {
"buckets" : [
{
"key" : 7.0,
"doc_count" : 4
},
{
"key" : 8.0,
"doc_count" : 6
}
]
}
}