Elasticsearch 7 Aggregations
An aggregation summarizes your data as metrics, statistics, or other analytics. Aggregations help you answer questions like:
- What’s the average load time for my website?
- Who are my most valuable customers based on transaction volume?
- What would be considered a large file on my network?
- How many products are in each product category?
Elasticsearch organizes aggregations into three categories:
- Metric aggregations that calculate metrics, such as a sum or average, from field values.
- Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria.
- Pipeline aggregations that take input from other aggregations instead of documents or fields.
Run an Aggregation
You can run aggregations as part of a search by specifying the search API's aggs parameter. The following search runs a terms aggregation on my-field
- e.g. product rating:
GET /my-index/_search
{
"aggs": {
"my-agg-name": {
"terms": {
"field": "rating"
}
}
}
}
Aggregation results are in the response’s aggregations object:
"aggregations": {
"my-agg-name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
You can also filter your results. For example, what is the average rating for a single product:
curl -XGET '127.0.0.1:9200/ratings/_search?size=0&pretty' -d
'{
"query": {
"match": {
"title": "My Product Title"
}
},
"aggs": {
"my-agg-name": {
"terms": {
"field": "rating"
}
}
}
}'
TV Ratings Example
As an example I am going to use my The Expanse TV Series Index that has a imdb_rating
field that I can run an aggregation query against. There we joint a couple of episodes that all belonged to season 01 of this show and we were able to query all of them with:
GET /the-expanse/_search
{
"query": {
"has_parent": {
"parent_type": "season",
"query": {
"match": {
"title": "Season 01"
}
}
}
}
}
Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [imdb-rating] in order to load field data by uninverting the inverted index. Note that this can use significant memory.
I was receiving the error message above at first before I did not add mapping types to this index. Which meant that the imdb-rating
field was mapped as text. I fixed this in the original article linked above and was now able to run the aggregation:
What is the Rating Distribution for Season 01 ?
curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"query": {
"has_parent": {
"parent_type": "season",
"query": {
"match": {
"title": "Season 01"
}
}
}
},
"aggs": {
"imdb-ratings": {
"terms": {
"field": "imdb-rating"
}
}
}
}'
I set the imdb_rating
type to float
which gave me those results:
"aggregations" : {
"imdb-ratings" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 8.699999809265137,
"doc_count" : 4
},
{
"key" : 7.699999809265137,
"doc_count" : 2
},
{
"key" : 8.0,
"doc_count" : 2
},
{
"key" : 7.800000190734863,
"doc_count" : 1
},
{
"key" : 7.900000095367432,
"doc_count" : 1
}
]
}
}
It is obviously better to use "type": "integer"
(as an alternative see Histograms below!) for the imdb_rating
which rounds things up to 6 times 8 Star and 4 times 7 Star rating for Season 01 of the expanse:
"aggregations" : {
"average-rating" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 8,
"doc_count" : 6
},
{
"key" : 7,
"doc_count" : 4
}
]
}
}
How many 8 Star Ratings did Season 01 get ?
This time we just need to filter for imdb-rating
's with an integer value of 8
:
curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"query": {
"match": {
"imdb-rating": 8
}
},
"aggs": {
"imdb-ratings": {
"terms": {
"field": "imdb-rating"
}
}
}
}'
And we will get the expected result - 6 episodes of season 01 of The Expanse had a rounded up IMDB rating of 8
:
"aggregations" : {
"imdb-ratings" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 8,
"doc_count" : 6
}
]
}
}
What is the Average Rating of all Episodes in Season 01
curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"query": {
"has_parent": {
"parent_type": "season",
"query": {
"match": {
"title": "Season 01"
}
}
}
},
"aggs": {
"average-ratings": {
"avg": {
"field": "imdb-rating"
}
}
}
}'
The average rating for season 01 of The Expanse is 7.6
:
"aggregations" : {
"average-ratings" : {
"value" : 7.6
}
}
Display Ratings by 1.0 Intervals
Before I ran into the issue that I set the type of my ratings to float
- instead of re-indexing the data with integer
ratings I could have also have used the Histogram function that Elasticsearch provides for aggregations:
curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
"aggs": {
"whole-ratings": {
"histogram": {
"field": "imdb-rating",
"interval": 1.0
}
}
}
}'
Even with the ratings set to float
I now get my data nicely formatted and ready to be used for a histogram visualisation:
"aggregations" : {
"whole-ratings" : {
"buckets" : [
{
"key" : 7.0,
"doc_count" : 4
},
{
"key" : 8.0,
"doc_count" : 6
}
]
}
}