2019-11-15 07:36:21 -05:00
|
|
|
[role="xpack"]
|
|
|
|
[testenv="basic"]
|
|
|
|
[[search-aggregations-metrics-string-stats-aggregation]]
|
2020-10-30 13:46:12 -04:00
|
|
|
=== String stats aggregation
|
|
|
|
++++
|
|
|
|
<titleabbrev>String stats</titleabbrev>
|
|
|
|
++++
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
A `multi-value` metrics aggregation that computes statistics over string values extracted from the aggregated documents.
|
|
|
|
These values can be retrieved either from specific `keyword` fields in the documents or can be generated by a provided script.
|
|
|
|
|
2020-07-20 15:05:33 -04:00
|
|
|
WARNING: Using scripts can result in slower search speeds. See
|
|
|
|
<<scripts-and-search-speed>>.
|
|
|
|
|
2019-11-15 07:36:21 -05:00
|
|
|
The string stats aggregation returns the following results:
|
|
|
|
|
|
|
|
* `count` - The number of non-empty fields counted.
|
|
|
|
* `min_length` - The length of the shortest term.
|
|
|
|
* `max_length` - The length of the longest term.
|
|
|
|
* `avg_length` - The average length computed over all terms.
|
2020-08-17 11:27:04 -04:00
|
|
|
* `entropy` - The {wikipedia}/Entropy_(information_theory)[Shannon Entropy] value computed over all terms collected by
|
2019-11-15 07:36:21 -05:00
|
|
|
the aggregation. Shannon entropy quantifies the amount of information contained in the field. It is a very useful metric for
|
|
|
|
measuring a wide range of properties of a data set, such as diversity, similarity, randomness etc.
|
|
|
|
|
2020-08-04 14:16:38 -04:00
|
|
|
For example:
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
[source,console]
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
POST /my-index-000001/_search?size=0
|
2019-11-15 07:36:21 -05:00
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
"aggs": {
|
|
|
|
"message_stats": { "string_stats": { "field": "message.keyword" } }
|
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
// TEST[setup:messages]
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
The above aggregation computes the string statistics for the `message` field in all documents. The aggregation type
|
|
|
|
is `string_stats` and the `field` parameter defines the field of the documents the stats will be computed on.
|
|
|
|
The above will return the following:
|
|
|
|
|
|
|
|
[source,console-result]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
...
|
|
|
|
|
|
|
|
"aggregations": {
|
|
|
|
"message_stats": {
|
|
|
|
"count": 5,
|
|
|
|
"min_length": 24,
|
|
|
|
"max_length": 30,
|
|
|
|
"avg_length": 28.8,
|
|
|
|
"entropy": 3.94617750050791
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
|
|
|
|
|
|
The name of the aggregation (`message_stats` above) also serves as the key by which the aggregation result can be retrieved from
|
|
|
|
the returned response.
|
|
|
|
|
|
|
|
==== Character distribution
|
|
|
|
|
|
|
|
The computation of the Shannon Entropy value is based on the probability of each character appearing in all terms collected
|
|
|
|
by the aggregation. To view the probability distribution for all characters, we can add the `show_distribution` (default: `false`) parameter.
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
POST /my-index-000001/_search?size=0
|
2019-11-15 07:36:21 -05:00
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
"aggs": {
|
|
|
|
"message_stats": {
|
|
|
|
"string_stats": {
|
|
|
|
"field": "message.keyword",
|
|
|
|
"show_distribution": true <1>
|
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
// TEST[setup:messages]
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
<1> Set the `show_distribution` parameter to `true`, so that probability distribution for all characters is returned in the results.
|
|
|
|
|
|
|
|
[source,console-result]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
...
|
|
|
|
|
|
|
|
"aggregations": {
|
|
|
|
"message_stats": {
|
|
|
|
"count": 5,
|
|
|
|
"min_length": 24,
|
|
|
|
"max_length": 30,
|
|
|
|
"avg_length": 28.8,
|
|
|
|
"entropy": 3.94617750050791,
|
|
|
|
"distribution": {
|
|
|
|
" ": 0.1527777777777778,
|
|
|
|
"e": 0.14583333333333334,
|
|
|
|
"s": 0.09722222222222222,
|
|
|
|
"m": 0.08333333333333333,
|
|
|
|
"t": 0.0763888888888889,
|
|
|
|
"h": 0.0625,
|
|
|
|
"a": 0.041666666666666664,
|
|
|
|
"i": 0.041666666666666664,
|
|
|
|
"r": 0.041666666666666664,
|
|
|
|
"g": 0.034722222222222224,
|
|
|
|
"n": 0.034722222222222224,
|
|
|
|
"o": 0.034722222222222224,
|
|
|
|
"u": 0.034722222222222224,
|
|
|
|
"b": 0.027777777777777776,
|
|
|
|
"w": 0.027777777777777776,
|
|
|
|
"c": 0.013888888888888888,
|
|
|
|
"E": 0.006944444444444444,
|
|
|
|
"l": 0.006944444444444444,
|
|
|
|
"1": 0.006944444444444444,
|
|
|
|
"2": 0.006944444444444444,
|
|
|
|
"3": 0.006944444444444444,
|
|
|
|
"4": 0.006944444444444444,
|
|
|
|
"y": 0.006944444444444444
|
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
|
|
|
|
|
|
The `distribution` object shows the probability of each character appearing in all terms. The characters are sorted by descending probability.
|
|
|
|
|
|
|
|
==== Script
|
|
|
|
|
|
|
|
Computing the message string stats based on a script:
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
POST /my-index-000001/_search?size=0
|
2019-11-15 07:36:21 -05:00
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
"aggs": {
|
|
|
|
"message_stats": {
|
|
|
|
"string_stats": {
|
|
|
|
"script": {
|
|
|
|
"lang": "painless",
|
|
|
|
"source": "doc['message.keyword'].value"
|
|
|
|
}
|
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
// TEST[setup:messages]
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters.
|
|
|
|
To use a stored script use the following syntax:
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
POST /my-index-000001/_search?size=0
|
2019-11-15 07:36:21 -05:00
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
"aggs": {
|
|
|
|
"message_stats": {
|
|
|
|
"string_stats": {
|
|
|
|
"script": {
|
|
|
|
"id": "my_script",
|
|
|
|
"params": {
|
|
|
|
"field": "message.keyword"
|
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
// TEST[setup:messages,stored_example_script]
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
===== Value Script
|
|
|
|
|
|
|
|
We can use a value script to modify the message (eg we can add a prefix) and compute the new stats:
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
POST /my-index-000001/_search?size=0
|
2019-11-15 07:36:21 -05:00
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
"aggs": {
|
|
|
|
"message_stats": {
|
|
|
|
"string_stats": {
|
|
|
|
"field": "message.keyword",
|
|
|
|
"script": {
|
|
|
|
"lang": "painless",
|
|
|
|
"source": "params.prefix + _value",
|
|
|
|
"params": {
|
|
|
|
"prefix": "Message: "
|
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
// TEST[setup:messages]
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
==== Missing value
|
|
|
|
|
|
|
|
The `missing` parameter defines how documents that are missing a value should be treated.
|
|
|
|
By default they will be ignored but it is also possible to treat them as if they had a value.
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
POST /my-index-000001/_search?size=0
|
2019-11-15 07:36:21 -05:00
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
"aggs": {
|
|
|
|
"message_stats": {
|
|
|
|
"string_stats": {
|
|
|
|
"field": "message.keyword",
|
|
|
|
"missing": "[empty message]" <1>
|
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
2020-07-20 15:59:00 -04:00
|
|
|
}
|
2019-11-15 07:36:21 -05:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2020-08-04 14:16:38 -04:00
|
|
|
// TEST[setup:messages]
|
2019-11-15 07:36:21 -05:00
|
|
|
|
|
|
|
<1> Documents without a value in the `message` field will be treated as documents that have the value `[empty message]`.
|