[DOCS] Add details about ML count functions (elastic/x-pack-elasticsearch#1335)

* [DOCS] Add details about ML count functions

* [DOCS] Address feedback in ML count functions

* [DOCS] Clarify ML population analysis in non-zero count functions

Original commit: elastic/x-pack-elasticsearch@24dbeba891
This commit is contained in:
Lisa Cawley 2017-06-01 14:16:14 -07:00 committed by GitHub
parent ff6fa6790e
commit 789cd66202
2 changed files with 205 additions and 100 deletions

View File

@ -12,36 +12,56 @@ If you are creating your job in {kib}, you specify the functions differently
depending on whether you are creating single metric, multi-metric, or advanced
jobs. For a demonstration of creating jobs in {kib}, see <<ml-getting-started>>.
//TBD: Determine what these fields are called in Kibana, for people who aren't using APIs
////
TBD: Integrate from prelert docs?:
By default, temporal (time-based) analysis is invoked, unless you also specify an
`over_field_name`, which shifts the analysis to be population- or peer-based.
When you specify `by_field_name` with a function, the analysis considers whether
there is an anomaly for one of more specific values of `by_field_name`.
NOTE: Some functions cannot be used with a `by_field_name` or `over_field_name`.
You can specify a `partition_field_name` with any function. When this is used,
the analysis is replicated for every distinct value of `partition_field_name`.
You can specify a `summary_count_field_name` with any function except metric.
When you use `summary_count_field_name`, the {ml} features expect the input
data to be pre-summarized. The value of the `summary_count_field_name` field
must contain the count of raw events that were summarized.
Some functions can benefit from overlapping buckets. This improves the overall
accuracy of the results but at the cost of a 2 bucket delay in seeing the results.
////
Most functions detect anomalies in both low and high values. In statistical
terminology, they apply a two-sided test. Some functions offer low and high
variations (for example, `count`, `low_count`, and `high_count`). These variations
apply one-sided tests, detecting anomalies only when the values are low or
high, depending one which alternative is used.
//For some functions, you can optionally specify a field name in the
//`by_field_name` property. The analysis then considers whether there is an
//anomaly for one of more specific values of that field. In {kib}, use the
//**Key Fields** field in multi-metric jobs or the **by_field_name** field in
//advanced jobs.
////
TODO: Per Sophie, "This is incorrect... Split Data refers to a partition_field_name. Over fields can only be added in Adv Config...
Can you please remove the explanations for by/over/partition fields from the documentation for analytical functions. It's a complex topic and will be easier to review in a separate exercise."
////
//For some functions, you can also optionally specify a field name in the
//`over_field_name` property. This property shifts the analysis to be population-
//or peer-based and uses the field to split the data. In {kib}, use the
//**Split Data** field in multi-metric jobs or the **over_field_name** field in
//advanced jobs.
//You can specify a `partition_field_name` with any function. The analysis is then
//segmented with completely independent baselines for each value of that field.
//In {kib}, use the **partition_field_name** field in advanced jobs.
You can specify a `summary_count_field_name` with any function except `metric`.
When you use `summary_count_field_name`, the {ml} features expect the input
data to be pre-aggregated. The value of the `summary_count_field_name` field
must contain the count of raw events that were summarized. In {kib}, use the
**summary_count_field_name** in advanced jobs. Analyzing aggregated input data
provides a significant boost in performance.
////
TODO: Add link to aggregations topic when it is available.
////
If your data is sparse, there may be gaps in the data which means you might have
empty buckets. You might want to treat these as anomalies or you might want these
gaps to be ignored. Your decision depends on your use case and what is important
to you. It also depends on which functions you use. The `sum` and `count`
functions are strongly affected by empty buckets. For this reason, there are
`non_null_sum` and `non_zero_count` functions, which are tolerant to sparse data.
These functions effectively ignore empty buckets.
////
Some functions can benefit from overlapping buckets. This improves the overall
accuracy of the results but at the cost of a 2 bucket delay in seeing the results.
The table below provides a high-level summary of the analytical functions provided by the API. Each of the functions is described in detail over the following pages. Note the examples given in these pages use single Detector Configuration objects.
////

View File

@ -1,13 +1,7 @@
[[ml-count-functions]]
=== Count Functions
The {xpackml} features include the following count functions:
* `count`, `high_count`, `low_count`
* `non_zero_count`, `high_non_zero_count`, `low_non_zero_count`
* `distinct_count`, `high_distinct_count`, `low_distinct_count`
Count functions detect anomalies when the count of events in a bucket is
Count functions detect anomalies when the number of events in a bucket is
anomalous.
Use `non_zero_count` functions if your data is sparse and you want to ignore
@ -19,111 +13,202 @@ in one field is unusual, as opposed to the total count.
Use high-sided functions if you want to monitor unusually high event rates.
Use low-sided functions if you want to look at drops in event rate.
The {xpackml} features include the following count functions:
////
* <<ml-count>>
* <<ml-high-count>>
* <<ml-low-count>>
* <<ml-nonzero-count>>
* <<ml-high-nonzero-count>>
* <<ml-low-nonzero-count>>
* xref:ml-count[`count`, `high_count`, `low_count`]
* xref:ml-nonzero-count[`non_zero_count`, `high_non_zero_count`, `low_non_zero_count`]
* xref:ml-distinct-count[`distinct_count`, high_distinct_count`, `low_distinct_count`]
[float]
[[ml-count]]
===== Count
===== Count, High_count, Low_count
The `count` function detects anomalies when the count of events in a bucket is
The `count` function detects anomalies when the number of events in a bucket is
anomalous.
* field_name: not applicable
* by_field_name: optional
* over_field_name: optional
The `high_count` function detects anomalies when the count of events in a
bucket are unusually high.
The `low_count` function detects anomalies when the count of events in a
bucket are unusually low.
These functions support the following properties:
* `by_field_name` (optional)
* `over_field_name` (optional)
* `partition_field_name` (optional)
For more information about those properties,
see <<ml-detectorconfig,Detector Configuration Objects>>.
.Example 1: Analyzing events with the count function
[source,js]
--------------------------------------------------
{ "function" : "count" }
--------------------------------------------------
This example is probably the simplest possible analysis! It identifies time
buckets during which the overall count of events is higher or lower than usual.
This example is probably the simplest possible analysis. It identifies
time buckets during which the overall count of events is higher or lower than
usual.
It models the event rate and detects when the event rate is unusual compared to
the past.
[float]
[[ml-high-count]]
===== High_count
The `high_count` function detects anomalies when the count of events in a
bucket are unusually high.
* field_name: not applicable
* by_field_name: optional
* over_field_name: optional
When you use this function in a detector in your job, it models the event rate
and detects when the event rate is unusual compared to its past behavior.
.Example 2: Analyzing errors with the high_count function
[source,js]
--------------------------------------------------
{ "function" : "high_count", "byFieldName" : "error_code", "overFieldName": "user" }
{
"function" : "high_count",
"by_field_name" : "error_code",
"over_field_name": "user"
}
--------------------------------------------------
This example models the event rate for each error code. It detects users that
generate an unusually high count of error codes compared to other users.
If you use this `high_count` function in a detector in your job, it
models the event rate for each error code. It detects users that generate an
unusually high count of error codes compared to other users.
[float]
[[ml-low-count]]
===== Low_count
The `low_count` function detects anomalies when the count of events in a
bucket are unusually low.
* field_name: not applicable
* by_field_name: optional
* over_field_name: optional
.Example 3: Analyzing status codes with the low_count function
[source,js]
--------------------------------------------------
{ "function" : "low_count", "byFieldName" : "status_code" }
{
"function" : "low_count",
"by_field_name" : "status_code"
}
--------------------------------------------------
In this example, there is a data stream that contains a field “status”. The
function detects when the count of events for a given status code is lower than
usual. It models the event rate for each status code and detects when a status
code has an unusually low count compared to its past behavior.
In this example, the function detects when the count of events for a
status code is lower than usual.
If the data stream consists of web server access log records, for example,
a drop in the count of events for a particular status code might be an indication
that something isnt working correctly.
When you use this function in a detector in your job, it models the event rate
for each status code and detects when a status code has an unusually low count
compared to its past behavior.
.Example 4: Analyzing aggregated data with the count function
[source,js]
--------------------------------------------------
{
"summary_count_field" : "events_per_min",
"detectors" [
{ "function" : "count" }
]
}
--------------------------------------------------
If you are analyzing an aggregated `events_per_min` field, do not use a sum
function (for example, `sum(events_per_min)`). Instead, use the count function
and the `summary_count_field` property.
//TO-DO: For more information, see <<aggreggations.asciidoc>>.
[float]
[[ml-nonzero-count]]
===== Non_zero_count
===== Non_zero_count, High_non_zero_count, Low_non_zero_count
The `non_zero_count` function detects anomalies when the number of events in a
bucket is anomalous, but it ignores cases where the bucket count is zero. Use
this function if you know your data is sparse or has gaps and the gaps are not
important.
The `high_non_zero_count` function detects anomalies when the number of events
in a bucket is unusually high and it ignores cases where the bucket count is
zero.
The `low_non_zero_count` function detects anomalies when the number of events in
a bucket is unusually low and it ignores cases where the bucket count is zero.
These functions support the following properties:
* `by_field_name` (optional)
* `partition_field_name` (optional)
For more information about those properties,
see <<ml-detectorconfig,Detector Configuration Objects>>.
For example, if you have the following number of events per bucket:
========================================
1,22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,43,31,0,0,0,0,0,0,0,0,0,0,0,0,2,1
========================================
The `non_zero_count` function models only the following data:
========================================
1,22,2,43,31,2,1
========================================
.Example 5: Analyzing signatures with the high_non_zero_count function
[source,js]
--------------------------------------------------
{
"function" : "high_non_zero_count",
"by_field_name" : "signaturename"
}
--------------------------------------------------
If you use this `high_non_zero_count` function in a detector in your job, it
models the count of events for the `signaturename` field. It ignores any buckets
where the count is zero and detects when a `signaturename` value has an
unusually high count of events compared to its past behavior.
NOTE: Population analysis (using an `over_field_name` property value) is not
supported for the `non_zero_count`, `high_non_zero_count`, and
`low_non_zero_count` functions. If you want to do population analysis and your
data is sparse, use the `count` functions, which are optimized for that scenario.
non_zero_count:: count, but zeros are treated as null and ignored
[float]
[[ml-high-nonzero-count]]
===== High_non_zero_count
[[ml-distinct-count]]
===== Distinct_count, High_distinct_count, Low_distinct_count
high_non_zero_count::: count, but zeros are treated as null and ignored
The `distinct_count` function detects anomalies where the number of distinct
values in one field is unusual.
[float]
[[ml-low-nonzero-count]]
===== Low_non_zero_count
The `high_distinct_count` function detects unusually high numbers of distinct
values in one field.
low_non_zero_count::: count, but zeros are treated as null and ignored
The `low_distinct_count` function detects unusually low numbers of distinct
values in one field.
[float]
[[ml-low-count]]
===== Low_count
distinct_count:: distinct count
These functions support the following properties:
[float]
[[ml-low-count]]
===== Low_count
high_distinct_count::: distinct count
* `field_name` (required)
* `by_field_name` (optional)
* `over_field_name` (optional)
* `partition_field_name` (optional)
[float]
[[ml-low-count]]
===== Low_count
low_distinct_count::: distinct count
////
For more information about those properties,
see <<ml-detectorconfig,Detector Configuration Objects>>.
.Example 6: Analyzing users with the distinct_count function
[source,js]
--------------------------------------------------
{
"function" : "distinct_count",
"field_name" : "user"
}
--------------------------------------------------
This `distinct_count` function detects when a system has an unusual number
of logged in users. When you use this function in a detector in your job, it
models the distinct count of users. It also detects when the distinct number of
users is unusual compared to the past.
.Example 7: Analyzing ports with the high_distinct_count function
[source,js]
--------------------------------------------------
{
"function" : "high_distinct_count",
"field_name" : "dst_port",
"over_field_name": "src_ip"
}
--------------------------------------------------
This example detects instances of port scanning. When you use this function in a
detector in your job, it models the distinct count of ports. It also detects the
`src_ip` values that connect to an unusually high number of different
`dst_ports` values compared to other `src_ip` values.