2014-06-06 10:25:21 -04:00
[[search-aggregations-metrics-percentile-rank-aggregation]]
=== Percentile Ranks Aggregation
A `multi-value` metrics aggregation that calculates one or more percentile ranks
2019-11-28 09:06:26 -05:00
over numeric values extracted from the aggregated documents. These values can be
generated by a provided script or extracted from specific numeric or
<<histogram,histogram fields>> in the documents.
2014-06-06 10:25:21 -04:00
[NOTE]
==================================================
2015-04-26 11:30:38 -04:00
Please see <<search-aggregations-metrics-percentile-aggregation-approximation>>
and <<search-aggregations-metrics-percentile-aggregation-compression>> for advice
2014-06-06 10:25:21 -04:00
regarding approximation and memory use of the percentile ranks aggregation
==================================================
2015-04-26 11:30:38 -04:00
Percentile rank show the percentage of observed values which are below certain
2014-06-06 10:25:21 -04:00
value. For example, if a value is greater than or equal to 95% of the observed values
it is said to be at the 95th percentile rank.
2015-04-26 11:30:38 -04:00
Assume your data consists of website load times. You may have a service agreement that
2019-10-15 15:56:16 -04:00
95% of page loads complete within 500ms and 99% of page loads complete within 600ms.
2014-06-06 10:25:21 -04:00
Let's look at a range of percentiles representing load time:
2019-09-05 10:11:25 -04:00
[source,console]
2014-06-06 10:25:21 -04:00
--------------------------------------------------
2017-12-14 11:47:53 -05:00
GET latency/_search
2014-06-06 10:25:21 -04:00
{
2017-08-02 17:47:27 -04:00
"size": 0,
2014-06-06 10:25:21 -04:00
"aggs" : {
2017-08-02 17:47:27 -04:00
"load_time_ranks" : {
2014-06-06 10:25:21 -04:00
"percentile_ranks" : {
2015-02-21 04:19:11 -05:00
"field" : "load_time", <1>
2017-08-02 17:47:27 -04:00
"values" : [500, 600]
2014-06-06 10:25:21 -04:00
}
}
}
}
--------------------------------------------------
2017-08-02 17:47:27 -04:00
// TEST[setup:latency]
2019-09-05 10:11:25 -04:00
2014-06-06 10:25:21 -04:00
<1> The field `load_time` must be a numeric field
The response will look like this:
2019-09-06 16:09:09 -04:00
[source,console-result]
2014-06-06 10:25:21 -04:00
--------------------------------------------------
{
...
"aggregations": {
2017-08-02 17:47:27 -04:00
"load_time_ranks": {
2014-06-06 10:25:21 -04:00
"values" : {
2019-12-12 08:38:48 -05:00
"500.0": 90.01,
"600.0": 100.0
2014-06-06 10:25:21 -04:00
}
}
}
}
--------------------------------------------------
2017-08-02 17:47:27 -04:00
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
2019-12-12 08:38:48 -05:00
// TESTRESPONSE[s/"500.0": 90.01/"500.0": 55.00000000000001/]
// TESTRESPONSE[s/"600.0": 100.0/"600.0": 64.0/]
2014-06-06 10:25:21 -04:00
2015-04-26 11:30:38 -04:00
From this information you can determine you are hitting the 99% load time target but not quite
2014-06-06 10:25:21 -04:00
hitting the 95% load time target
2017-04-18 09:57:50 -04:00
==== Keyed Response
By default the `keyed` flag is set to `true` associates a unique string key with each bucket and returns the ranges as a hash rather than an array. Setting the `keyed` flag to `false` will disable this behavior:
2019-09-05 10:11:25 -04:00
[source,console]
2017-04-18 09:57:50 -04:00
--------------------------------------------------
2017-12-14 11:47:53 -05:00
GET latency/_search
2017-04-18 09:57:50 -04:00
{
2017-08-02 17:47:27 -04:00
"size": 0,
2017-04-18 09:57:50 -04:00
"aggs": {
2017-08-02 17:47:27 -04:00
"load_time_ranks": {
2017-04-18 09:57:50 -04:00
"percentile_ranks": {
2017-08-02 17:47:27 -04:00
"field": "load_time",
"values": [500, 600],
2017-04-18 09:57:50 -04:00
"keyed": false
}
}
}
}
--------------------------------------------------
2017-08-02 17:47:27 -04:00
// TEST[setup:latency]
2017-04-18 09:57:50 -04:00
Response:
2019-09-06 16:09:09 -04:00
[source,console-result]
2017-04-18 09:57:50 -04:00
--------------------------------------------------
{
...
"aggregations": {
2017-08-02 17:47:27 -04:00
"load_time_ranks": {
2017-04-18 09:57:50 -04:00
"values": [
{
2017-08-02 17:47:27 -04:00
"key": 500.0,
2019-12-12 08:38:48 -05:00
"value": 90.01
2017-04-18 09:57:50 -04:00
},
{
2017-08-02 17:47:27 -04:00
"key": 600.0,
2019-12-12 08:38:48 -05:00
"value": 100.0
2017-04-18 09:57:50 -04:00
}
]
}
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
2019-12-12 08:38:48 -05:00
// TESTRESPONSE[s/"value": 90.01/"value": 55.00000000000001/]
// TESTRESPONSE[s/"value": 100.0/"value": 64.0/]
2017-08-02 17:47:27 -04:00
2014-06-06 10:25:21 -04:00
==== Script
The percentile rank metric supports scripting. For example, if our load times
are in milliseconds but we want to specify values in seconds, we could use
a script to convert them on-the-fly:
2019-09-05 10:11:25 -04:00
[source,console]
2014-06-06 10:25:21 -04:00
--------------------------------------------------
2017-12-14 11:47:53 -05:00
GET latency/_search
2014-06-06 10:25:21 -04:00
{
2017-08-02 17:47:27 -04:00
"size": 0,
2014-06-06 10:25:21 -04:00
"aggs" : {
2017-08-02 17:47:27 -04:00
"load_time_ranks" : {
2014-06-06 10:25:21 -04:00
"percentile_ranks" : {
2017-08-02 17:47:27 -04:00
"values" : [500, 600],
2015-05-12 05:37:22 -04:00
"script" : {
2016-06-27 09:55:16 -04:00
"lang": "painless",
2017-06-09 11:29:25 -04:00
"source": "doc['load_time'].value / params.timeUnit", <1>
2015-05-12 05:37:22 -04:00
"params" : {
"timeUnit" : 1000 <2>
}
2014-06-06 10:25:21 -04:00
}
}
}
}
}
--------------------------------------------------
2017-08-02 17:47:27 -04:00
// TEST[setup:latency]
2019-09-05 10:11:25 -04:00
2014-06-06 10:25:21 -04:00
<1> The `field` parameter is replaced with a `script` parameter, which uses the
script to generate values which percentile ranks are calculated on
<2> Scripting supports parameterized input just like any other script
2015-04-26 11:30:38 -04:00
2017-05-17 17:42:25 -04:00
This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. To use a stored script use the following syntax:
2015-05-12 05:37:22 -04:00
2019-09-05 10:11:25 -04:00
[source,console]
2015-05-12 05:37:22 -04:00
--------------------------------------------------
2017-12-14 11:47:53 -05:00
GET latency/_search
2015-05-12 05:37:22 -04:00
{
2017-08-02 17:47:27 -04:00
"size": 0,
2015-05-12 05:37:22 -04:00
"aggs" : {
2017-08-02 17:47:27 -04:00
"load_time_ranks" : {
2015-05-12 05:37:22 -04:00
"percentile_ranks" : {
2017-08-02 17:47:27 -04:00
"values" : [500, 600],
2015-05-12 05:37:22 -04:00
"script" : {
2017-06-09 11:29:25 -04:00
"id": "my_script",
2017-08-02 17:47:27 -04:00
"params": {
"field": "load_time"
2015-05-12 05:37:22 -04:00
}
}
}
}
}
}
--------------------------------------------------
2017-08-02 17:47:27 -04:00
// TEST[setup:latency,stored_example_script]
2015-05-12 05:37:22 -04:00
2015-07-20 07:23:21 -04:00
==== HDR Histogram
2017-07-18 08:06:22 -04:00
NOTE: This setting exposes the internal implementation of HDR Histogram and the syntax may change in the future.
2015-07-20 07:23:21 -04:00
2016-07-27 03:48:35 -04:00
https://github.com/HdrHistogram/HdrHistogram[HDR Histogram] (High Dynamic Range Histogram) is an alternative implementation
that can be useful when calculating percentile ranks for latency measurements as it can be faster than the t-digest implementation
with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a
number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000
microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to
2015-07-20 07:23:21 -04:00
1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).
The HDR Histogram can be used by specifying the `method` parameter in the request:
2019-09-05 10:11:25 -04:00
[source,console]
2015-07-20 07:23:21 -04:00
--------------------------------------------------
2017-12-14 11:47:53 -05:00
GET latency/_search
2015-07-20 07:23:21 -04:00
{
2017-08-02 17:47:27 -04:00
"size": 0,
2015-07-20 07:23:21 -04:00
"aggs" : {
2017-08-02 17:47:27 -04:00
"load_time_ranks" : {
2015-07-20 07:23:21 -04:00
"percentile_ranks" : {
"field" : "load_time",
2017-08-02 17:47:27 -04:00
"values" : [500, 600],
2016-07-27 03:48:35 -04:00
"hdr": { <1>
"number_of_significant_value_digits" : 3 <2>
}
2015-07-20 07:23:21 -04:00
}
}
}
}
--------------------------------------------------
2017-08-02 17:47:27 -04:00
// TEST[setup:latency]
2019-09-05 10:11:25 -04:00
2016-07-27 03:48:35 -04:00
<1> `hdr` object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object
2015-07-20 07:23:21 -04:00
<2> `number_of_significant_value_digits` specifies the resolution of values for the histogram in number of significant digits
2016-07-27 03:48:35 -04:00
The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use
2015-07-20 07:23:21 -04:00
the HDRHistogram if the range of values is unknown as this could lead to high memory usage.
2015-05-07 10:46:40 -04:00
==== Missing value
The `missing` parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
2019-09-05 10:11:25 -04:00
[source,console]
2015-05-07 10:46:40 -04:00
--------------------------------------------------
2018-04-13 03:07:51 -04:00
GET latency/_search
2015-05-07 10:46:40 -04:00
{
2017-08-02 17:47:27 -04:00
"size": 0,
2015-05-07 10:46:40 -04:00
"aggs" : {
2017-08-02 17:47:27 -04:00
"load_time_ranks" : {
2015-05-07 10:46:40 -04:00
"percentile_ranks" : {
2017-08-02 17:47:27 -04:00
"field" : "load_time",
"values" : [500, 600],
2015-05-07 10:46:40 -04:00
"missing": 10 <1>
}
}
}
}
--------------------------------------------------
2017-08-02 17:47:27 -04:00
// TEST[setup:latency]
2019-09-05 10:11:25 -04:00
2017-08-02 17:47:27 -04:00
<1> Documents without a value in the `load_time` field will fall into the same bucket as documents that have the value `10`.