2014-01-06 17:35:51 -05:00
|
|
|
[[search-aggregations-metrics-percentile-aggregation]]
|
|
|
|
=== Percentiles
|
2014-03-07 03:50:50 -05:00
|
|
|
|
2014-03-25 11:57:22 -04:00
|
|
|
added[1.1.0]
|
2014-03-07 03:50:50 -05:00
|
|
|
|
2014-01-06 17:35:51 -05:00
|
|
|
A `multi-value` metrics aggregation that calculates one or more percentiles
|
|
|
|
over numeric values extracted from the aggregated documents. These values
|
|
|
|
can be extracted either from specific numeric fields in the documents, or
|
|
|
|
be generated by a provided script.
|
|
|
|
|
|
|
|
.Experimental!
|
|
|
|
[IMPORTANT]
|
|
|
|
=====
|
|
|
|
This feature is marked as experimental, and may be subject to change in the
|
|
|
|
future. If you use this feature, please let us know your experience with it!
|
|
|
|
=====
|
|
|
|
|
|
|
|
Percentiles show the point at which a certain percentage of observed values
|
|
|
|
occur. For example, the 95th percentile is the value which is greater than 95%
|
|
|
|
of the observed values.
|
|
|
|
|
|
|
|
Percentiles are often used to find outliers. In normal distributions, the
|
|
|
|
0.13th and 99.87th percentiles represents three standard deviations from the
|
|
|
|
mean. Any data which falls outside three standard deviations is often considered
|
|
|
|
an anomaly.
|
|
|
|
|
|
|
|
When a range of percentiles are retrieved, they can be used to estimate the
|
|
|
|
data distribution and determine if the data is skewed, bimodal, etc.
|
|
|
|
|
|
|
|
Assume your data consists of website load times. The average and median
|
|
|
|
load times are not overly useful to an administrator. The max may be interesting,
|
|
|
|
but it can be easily skewed by a single slow response.
|
|
|
|
|
|
|
|
Let's look at a range of percentiles representing load time:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"aggs" : {
|
|
|
|
"load_time_outlier" : {
|
|
|
|
"percentiles" : {
|
|
|
|
"field" : "load_time" <1>
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
<1> The field `load_time` must be a numeric field
|
|
|
|
|
|
|
|
By default, the `percentile` metric will generate a range of
|
|
|
|
percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
...
|
|
|
|
|
|
|
|
"aggregations": {
|
|
|
|
"load_time_outlier": {
|
|
|
|
"1.0": 15,
|
|
|
|
"5.0": 20,
|
|
|
|
"25.0": 23,
|
|
|
|
"50.0": 25,
|
|
|
|
"75.0": 29,
|
|
|
|
"95.0": 60,
|
|
|
|
"99.0": 150
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
As you can see, the aggregation will return a calculated value for each percentile
|
|
|
|
in the default range. If we assume response times are in milliseconds, it is
|
|
|
|
immediately obvious that the webpage normally loads in 15-30ms, but occasionally
|
|
|
|
spikes to 60-150ms.
|
|
|
|
|
|
|
|
Often, administrators are only interested in outliers -- the extreme percentiles.
|
|
|
|
We can specify just the percents we are interested in (requested percentiles
|
|
|
|
must be a value between 0-100 inclusive):
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"aggs" : {
|
|
|
|
"load_time_outlier" : {
|
|
|
|
"percentiles" : {
|
|
|
|
"field" : "load_time",
|
|
|
|
"percents" : [95, 99, 99.9] <1>
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
<1> Use the `percents` parameter to specify particular percentiles to calculate
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
==== Script
|
|
|
|
|
|
|
|
The percentile metric supports scripting. For example, if our load times
|
|
|
|
are in milliseconds but we want percentiles calculated in seconds, we could use
|
|
|
|
a script to convert them on-the-fly:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"aggs" : {
|
|
|
|
"load_time_outlier" : {
|
|
|
|
"percentiles" : {
|
|
|
|
"script" : "doc['load_time'].value / timeUnit", <1>
|
|
|
|
"params" : {
|
|
|
|
"timeUnit" : 1000 <2>
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
<1> The `field` parameter is replaced with a `script` parameter, which uses the
|
|
|
|
script to generate values which percentiles are calculated on
|
|
|
|
<2> Scripting supports parameterized input just like any other script
|
|
|
|
|
|
|
|
==== Percentiles are (usually) approximate
|
|
|
|
|
|
|
|
There are many different algorithms to calculate percentiles. The naive
|
|
|
|
implementation simply stores all the values in a sorted array. To find the 50th
|
2014-03-04 07:35:49 -05:00
|
|
|
percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
|
2014-01-06 17:35:51 -05:00
|
|
|
|
|
|
|
Clearly, the naive implementation does not scale -- the sorted array grows
|
|
|
|
linearly with the number of values in your dataset. To calculate percentiles
|
|
|
|
across potentially billions of values in an Elasticsearch cluster, _approximate_
|
|
|
|
percentiles are calculated.
|
|
|
|
|
|
|
|
The algorithm used by the `percentile` metric is called TDigest (introduced by
|
|
|
|
Ted Dunning in
|
|
|
|
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]).
|
|
|
|
|
|
|
|
When using this metric, there are a few guidelines to keep in mind:
|
|
|
|
|
|
|
|
- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%)
|
|
|
|
are more accurate than less extreme percentiles, such as the median
|
|
|
|
- For small sets of values, percentiles are highly accurate (and potentially
|
|
|
|
100% accurate if the data is small enough).
|
|
|
|
- As the quantity of values in a bucket grows, the algorithm begins to approximate
|
|
|
|
the percentiles. It is effectively trading accuracy for memory savings. The
|
|
|
|
exact level of inaccuracy is difficult to generalize, since it depends on your
|
|
|
|
data distribution and volume of data being aggregated
|
|
|
|
|
2014-03-14 07:22:48 -04:00
|
|
|
The following chart shows the relative error on a uniform distribution depending
|
|
|
|
on the number of collected values and the requested percentile:
|
|
|
|
|
|
|
|
image:images/percentiles_error.png[]
|
|
|
|
|
|
|
|
It shows how precision is better for extreme percentiles. The reason why error diminishes
|
|
|
|
for large number of values is that the law of large numbers makes the distribution of
|
|
|
|
values more and more uniform and the t-digest tree can do a better job at summarizing
|
|
|
|
it. It would not be the case on more skewed distributions.
|
|
|
|
|
2014-01-06 17:35:51 -05:00
|
|
|
==== Compression
|
|
|
|
|
|
|
|
Approximate algorithms must balance memory utilization with estimation accuracy.
|
|
|
|
This balance can be controlled using a `compression` parameter:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"aggs" : {
|
|
|
|
"load_time_outlier" : {
|
|
|
|
"percentiles" : {
|
|
|
|
"field" : "load_time",
|
|
|
|
"compression" : 200 <1>
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
<1> Compression controls memory usage and approximation error
|
|
|
|
|
|
|
|
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
|
|
|
|
more nodes available, the higher the accuracy (and large memory footprint) proportional
|
|
|
|
to the volume of data. The `compression` parameter limits the maximum number of
|
|
|
|
nodes to `100 * compression`.
|
|
|
|
|
|
|
|
Therefore, by increasing the compression value, you can increase the accuracy of
|
|
|
|
your percentiles at the cost of more memory. Larger compression values also
|
|
|
|
make the algorithm slower since the underlying tree data structure grows in size,
|
|
|
|
resulting in more expensive operations. The default compression value is
|
|
|
|
`100`.
|
|
|
|
|
|
|
|
A "node" uses roughly 48 bytes of memory, so under worst-case scenarios (large amount
|
|
|
|
of data which arrives sorted and in-order) the default settings will produce a
|
|
|
|
TDigest roughly 480KB in size. In practice data tends to be more random and
|
|
|
|
the TDigest will use less memory.
|