Elasticsearch 7 Aggregations

Abashiri, Japan

Run an Aggregation
- TV Ratings Example

An aggregation summarizes your data as metrics, statistics, or other analytics. Aggregations help you answer questions like:

What’s the average load time for my website?
Who are my most valuable customers based on transaction volume?
What would be considered a large file on my network?
How many products are in each product category?

Elasticsearch organizes aggregations into three categories:

Metric aggregations that calculate metrics, such as a sum or average, from field values.
Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria.
Pipeline aggregations that take input from other aggregations instead of documents or fields.

Run an Aggregation

You can run aggregations as part of a search by specifying the search API's aggs parameter. The following search runs a terms aggregation on my-field - e.g. product rating:

GET /my-index/_search
{
  "aggs": {
    "my-agg-name": {
      "terms": {
        "field": "rating"
      }
    }
  }
}

Aggregation results are in the response’s aggregations object:

"aggregations": {
    "my-agg-name": {                           
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }

You can also filter your results. For example, what is the average rating for a single product:

curl -XGET '127.0.0.1:9200/ratings/_search?size=0&pretty' -d
'{
  "query": {
    "match": {
      "title": "My Product Title"
    }
  },
  "aggs": {
    "my-agg-name": {
      "terms": {
        "field": "rating"
      }
    }
  }
}'

TV Ratings Example

As an example I am going to use my The Expanse TV Series Index that has a imdb_rating field that I can run an aggregation query against. There we joint a couple of episodes that all belonged to season 01 of this show and we were able to query all of them with:

GET /the-expanse/_search
{
  "query": {
    "has_parent": {
      "parent_type": "season",
  	  "query": {
  		  "match": {
  			  "title": "Season 01"
  		  }
  	  }
    }
  }
}

Elasticsearch 7 Aggregations

Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [imdb-rating] in order to load field data by uninverting the inverted index. Note that this can use significant memory.

I was receiving the error message above at first before I did not add mapping types to this index. Which meant that the imdb-rating field was mapped as text. I fixed this in the original article linked above and was now able to run the aggregation:

What is the Rating Distribution for Season 01 ?

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
  "query": {
    "has_parent": {
      "parent_type": "season",
  	  "query": {
  		  "match": {
  			  "title": "Season 01"
  		  }
  	  }
    }
  },
  "aggs": {
    "imdb-ratings": {
      "terms": {
        "field": "imdb-rating"
      }
    }
  }
}'

I set the imdb_rating type to float which gave me those results:

"aggregations" : {
    "imdb-ratings" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 8.699999809265137,
          "doc_count" : 4
        },
        {
          "key" : 7.699999809265137,
          "doc_count" : 2
        },
        {
          "key" : 8.0,
          "doc_count" : 2
        },
        {
          "key" : 7.800000190734863,
          "doc_count" : 1
        },
        {
          "key" : 7.900000095367432,
          "doc_count" : 1
        }
      ]
    }
  }

It is obviously better to use "type": "integer" (as an alternative see Histograms below!) for the imdb_rating which rounds things up to 6 times 8 Star and 4 times 7 Star rating for Season 01 of the expanse:

"aggregations" : {
    "average-rating" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 8,
          "doc_count" : 6
        },
        {
          "key" : 7,
          "doc_count" : 4
        }
      ]
    }
  }

How many 8 Star Ratings did Season 01 get ?

This time we just need to filter for imdb-rating's with an integer value of 8:

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
  "query": {
    "match": {
      "imdb-rating": 8
    }
  },
  "aggs": {
    "imdb-ratings": {
      "terms": {
        "field": "imdb-rating"
      }
    }
  }
}'

And we will get the expected result - 6 episodes of season 01 of The Expanse had a rounded up IMDB rating of 8:

"aggregations" : {
    "imdb-ratings" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 8,
          "doc_count" : 6
        }
      ]
    }
  }

What is the Average Rating of all Episodes in Season 01

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
  "query": {
    "has_parent": {
      "parent_type": "season",
  	  "query": {
  		  "match": {
  			  "title": "Season 01"
  		  }
  	  }
    }
  },
  "aggs": {
    "average-ratings": {
      "avg": {
        "field": "imdb-rating"
      }
    }
  }
}'

The average rating for season 01 of The Expanse is 7.6:

"aggregations" : {
    "average-ratings" : {
      "value" : 7.6
    }
  }

Display Ratings by 1.0 Intervals

Before I ran into the issue that I set the type of my ratings to float - instead of re-indexing the data with integer ratings I could have also have used the Histogram function that Elasticsearch provides for aggregations:

curl -H "Content-Type: application/json" -XGET 'localhost:9200/the-expanse/_search?size=0&pretty' -d'
{
  "aggs": {
    "whole-ratings": {
      "histogram": {
        "field": "imdb-rating",
        "interval": 1.0
      }
    }
  }
}'

Even with the ratings set to float I now get my data nicely formatted and ready to be used for a histogram visualisation:

"aggregations" : {
    "whole-ratings" : {
      "buckets" : [
        {
          "key" : 7.0,
          "doc_count" : 4
        },
        {
          "key" : 8.0,
          "doc_count" : 6
        }
      ]
    }
  }

Run an Aggregation​

TV Ratings Example​

What is the Rating Distribution for Season 01 ?​

How many 8 Star Ratings did Season 01 get ?​

What is the Average Rating of all Episodes in Season 01​

Display Ratings by 1.0 Intervals​