[DOCS] Add ML analytical functions (elastic/x-pack-elasticsearch#1319)

* [DOCS] Add ML analytical functions

* [DOCS] Add pages for ML analytical functions

* [DOCS] Add links to ML functions from API definitions

Original commit: elastic/x-pack-elasticsearch@ae50b431d3
This commit is contained in:
Lisa Cawley 2017-05-05 10:40:17 -07:00 committed by lcawley
parent 3570eb32d3
commit 45cfc17ea1
12 changed files with 418 additions and 44 deletions

View File

@ -10,6 +10,13 @@ apply plugin: 'elasticsearch.docs-test'
* entirely and have a party! There will be cake and everything.... */
buildRestTests.expectedUnconvertedCandidates = [
'en/ml/getting-started.asciidoc',
'en/ml/functions/count.asciidoc',
'en/ml/functions/geo.asciidoc',
'en/ml/functions/info.asciidoc',
'en/ml/functions/metric.asciidoc',
'en/ml/functions/rare.asciidoc',
'en/ml/functions/sum.asciidoc',
'en/ml/functions/time.asciidoc',
'en/rest-api/security/users.asciidoc',
'en/rest-api/security/tokens.asciidoc',
'en/rest-api/watcher/put-watch.asciidoc',

View File

@ -0,0 +1,54 @@
[float]
[[ml-functions]]
=== Analytical Functions
The {xpackml} features include analysis functions that provide a wide variety of
flexible ways to analyze data for anomalies.
When you create jobs, you specify one or more detectors, which define the type of
analysis that needs to be done. If you are creating your job by using {ml} APIs,
you specify the functions in <<ml-detectorconfig,Detector Configuration Objects>>.
If you are creating your job in {kib}, you specify the functions differently
depending on whether you are creating single metric, multi-metric, or advanced
jobs. For a demonstration of creating jobs in {kib}, see <<ml-getting-started>>.
//TBD: Determine what these fields are called in Kibana, for people who aren't using APIs
////
TBD: Integrate from prelert docs?:
By default, temporal (time-based) analysis is invoked, unless you also specify an
`over_field_name`, which shifts the analysis to be population- or peer-based.
When you specify `by_field_name` with a function, the analysis considers whether
there is an anomaly for one of more specific values of `by_field_name`.
NOTE: Some functions cannot be used with a `by_field_name` or `over_field_name`.
You can specify a `partition_field_name` with any function. When this is used,
the analysis is replicated for every distinct value of `partition_field_name`.
You can specify a `summary_count_field_name` with any function except metric.
When you use `summary_count_field_name`, the {ml} features expect the input
data to be pre-summarized. The value of the `summary_count_field_name` field
must contain the count of raw events that were summarized.
Some functions can benefit from overlapping buckets. This improves the overall
accuracy of the results but at the cost of a 2 bucket delay in seeing the results.
////
Most functions detect anomalies in both low and high values. In statistical
terminology, they apply a two-sided test. Some functions offer low and high
variations (for example, `count`, `low_count`, and `high_count`). These variations
apply one-sided tests, detecting anomalies only when the values are low or
high, depending one which alternative is used.
////
The table below provides a high-level summary of the analytical functions provided by the API. Each of the functions is described in detail over the following pages. Note the examples given in these pages use single Detector Configuration objects.
////
* <<ml-count-functions>>
* <<ml-geo-functions>>
* <<ml-info-functions>>
* <<ml-metric-functions>>
* <<ml-rare-functions>>
* <<ml-sum-functions>>
* <<ml-time-functions>>

View File

@ -0,0 +1,129 @@
[[ml-count-functions]]
=== Count Functions
The {xpackml} features include the following count functions:
* `count`, `high_count`, `low_count`
* `non_zero_count`, `high_non_zero_count`, `low_non_zero_count`
* `distinct_count`, `high_distinct_count`, `low_distinct_count`
Count functions detect anomalies when the count of events in a bucket is
anomalous.
Use `non_zero_count` functions if your data is sparse and you want to ignore
cases where the bucket count is zero.
Use `distinct_count` functions to determine when the number of distinct values
in one field is unusual, as opposed to the total count.
Use high-sided functions if you want to monitor unusually high event rates.
Use low-sided functions if you want to look at drops in event rate.
////
* <<ml-count>>
* <<ml-high-count>>
* <<ml-low-count>>
* <<ml-nonzero-count>>
* <<ml-high-nonzero-count>>
* <<ml-low-nonzero-count>>
[float]
[[ml-count]]
===== Count
The `count` function detects anomalies when the count of events in a bucket is
anomalous.
* field_name: not applicable
* by_field_name: optional
* over_field_name: optional
[source,js]
--------------------------------------------------
{ "function" : "count" }
--------------------------------------------------
This example is probably the simplest possible analysis! It identifies time
buckets during which the overall count of events is higher or lower than usual.
It models the event rate and detects when the event rate is unusual compared to
the past.
[float]
[[ml-high-count]]
===== High_count
The `high_count` function detects anomalies when the count of events in a
bucket are unusually high.
* field_name: not applicable
* by_field_name: optional
* over_field_name: optional
[source,js]
--------------------------------------------------
{ "function" : "high_count", "byFieldName" : "error_code", "overFieldName": "user" }
--------------------------------------------------
This example models the event rate for each error code. It detects users that
generate an unusually high count of error codes compared to other users.
[float]
[[ml-low-count]]
===== Low_count
The `low_count` function detects anomalies when the count of events in a
bucket are unusually low.
* field_name: not applicable
* by_field_name: optional
* over_field_name: optional
[source,js]
--------------------------------------------------
{ "function" : "low_count", "byFieldName" : "status_code" }
--------------------------------------------------
In this example, there is a data stream that contains a field “status”. The
function detects when the count of events for a given status code is lower than
usual. It models the event rate for each status code and detects when a status
code has an unusually low count compared to its past behavior.
If the data stream consists of web server access log records, for example,
a drop in the count of events for a particular status code might be an indication
that something isnt working correctly.
[float]
[[ml-nonzero-count]]
===== Non_zero_count
non_zero_count:: count, but zeros are treated as null and ignored
[float]
[[ml-high-nonzero-count]]
===== High_non_zero_count
high_non_zero_count::: count, but zeros are treated as null and ignored
[float]
[[ml-low-nonzero-count]]
===== Low_non_zero_count
low_non_zero_count::: count, but zeros are treated as null and ignored
[float]
[[ml-low-count]]
===== Low_count
distinct_count:: distinct count
[float]
[[ml-low-count]]
===== Low_count
high_distinct_count::: distinct count
[float]
[[ml-low-count]]
===== Low_count
low_distinct_count::: distinct count
////

View File

@ -0,0 +1,13 @@
[[ml-geo-functions]]
=== Geographic Functions
The {xpackml} features include the following geographic functions:
* `lat_long`
The geographic functions detect anomalies in the geographic location of the
input data.
The `field_name` that you supply must be a string of the form
`latitude,longitude`. The `latitude` and `longitude` must be in the range -180
to 180 and represent a point on the surface of the Earth.

View File

@ -0,0 +1,20 @@
[[ml-info-functions]]
=== Information Content Functions
The {xpackml} features include the following information content functions:
* `info_content`, `high_info_content`, `low_info_content`
The information content functions detect anomalies in the amount of information
that is contained in strings within a bucket. These functions can be used as
a more sophisticated method to identify incidences of data exfiltration or
C2C activity, when analyzing the size in bytes of the data might not be sufficient.
If you want to monitor for unusually high amounts of information, use `high_info_content`.
If want to look at drops in information content, use `low_info_content`.
////
info_content:: information content
high_info_content::: information content
////

View File

@ -0,0 +1,36 @@
[[ml-metric-functions]]
=== Metric Functions
The {xpackml} features include the following metric functions:
* `min`
* `max`
* `mean`, `high_mean`, `low_mean`
* `metric`
* `varp`, `high_varp`, `low_varp`
The metric functions include mean, min and max. These values are calculated for each bucket.
Field values that cannot be converted to double precision floating point numbers
are ignored.
////
metric:: all of mean, min, and max
mean:: arithmetic mean
high_mean::: arithmetic mean
low_mean::: arithmetic mean
median:: statistical median
min:: arithmetic minimum
max:: arithmetic maximum
varp:: population variance
high_varp::: ""
low_varp::: ""
////

View File

@ -0,0 +1,34 @@
[[ml-rare-functions]]
=== Rare Functions
The {xpackml} features include the following rare functions:
* `rare`
* `freq_rare`
The rare functions detect values that occur rarely in time or rarely for a
population.
The `rare` analysis detects anomalies according to the number of distinct rare
values. This differs from `freq_rare`, which detects anomalies according to the
number of times (frequency) rare values occur.
[NOTE]
====
* The `rare` and `freq_rare` functions should not be used in conjunction with
`exclude_frequent`.
* Shorter bucket spans (less than 1 hour, for example) are recommended when
looking for rare events. The functions model whether something happens in a
bucket at least once. With longer bucket spans, it is more likely that
entities will be seen in a bucket and therefore they appear less rare.
Picking the ideal the bucket span depends on the characteristics of the data
with shorter bucket spans typically being measured in minutes, not hours.
* To model rare data, a learning period of at least 20 buckets is required
for typical data.
====
////
rare:: rare items
freq_rare:: frequently rare items
////

View File

@ -0,0 +1,26 @@
[[ml-sum-functions]]
=== Sum Functions
The {xpackml} features include the following sum functions:
* `sum`, `high_sum`, `low_sum`
* `non_null_sum`, `high_non_null_sum`, `low_non_null_sum`
The sum functions detect anomalies when the sum of a field in a bucket is anomalous.
Use high-sided functions if you want to monitor unusually high totals.
Use low-sided functions if want to look at drops in totals.
Use `non_null_sum` functions if your data is sparse. Buckets without values will
be ignored; buckets with a zero value will be analyzed.
NOTE: Input data can contain pre-calculated fields that give the total count of some value. For
example, transactions per minute.
////
TBD: Incorporate from prelert docs?:
Ensure you are familiar with our advice on Summarization of Input Data, as this is likely to provide
a more appropriate method to using the sum function.
////

View File

@ -0,0 +1,31 @@
[[ml-time-functions]]
=== Time Functions
The {xpackml} features include the following time functions:
* `time_of_day`
* `time_of_week`
The time functions detect events that happen at unusual times, either of the day
or of the week. These functions can be used to find unusual patterns of behavior,
typically associated with suspicious user activity.
[NOTE]
====
* The `time_of_day` function is not aware of the difference between days, for instance
work days and weekends. When modeling different days, use the `time_of_week` function.
In general, the `time_of_week` function is more suited to modeling the behavior of people
rather than machines, as people vary their behavior according to the day of the week.
* Shorter bucket spans (for example, 10 minutes) are recommended when performing a
`time_of_day` or `time_of_week` analysis. The time of the events being modeled are not
affected by the bucket span, but a shorter bucket span enables quicker alerting on unusual
events.
* Unusual events are flagged based on the previous pattern of the data, not on what we
might think of as unusual based on human experience. So, if events typically occur
between 3 a.m. and 5 a.m., and event occurring at 3 p.m. is be flagged as unusual.
* When Daylight Saving Time starts or stops, regular events can be flagged as anomalous.
This situation occurs because the actual time of the event (as measured against a UTC
baseline) has changed. This situation is treated as a step change in behavior and the new
times will be learned quickly.
====

View File

@ -25,51 +25,9 @@ configurations in order to get the benefits of {ml}.
Machine learning is tightly integrated with the Elastic Stack. Data is pulled
from {es} for analysis and anomaly results are displayed in {kib} dashboards.
[float]
[[ml-concepts]]
== Basic Concepts
There are a few concepts that are core to {ml} in {xpack}. Understanding these
concepts from the outset will tremendously help ease the learning process.
Jobs::
Machine learning jobs contain the configuration information and metadata
necessary to perform an analytics task. For a list of the properties associated
with a job, see <<ml-job-resource, Job Resources>>.
{dfeeds-cap}::
Jobs can analyze either a one-off batch of data or continuously in real time.
{dfeeds-cap} retrieve data from {es} for analysis. Alternatively you can
<<ml-post-data,POST data>> from any source directly to an API.
Detectors::
As part of the configuration information that is associated with a job,
detectors define the type of analysis that needs to be done. They also specify
which fields to analyze. You can have more than one detector in a job, which
is more efficient than running multiple jobs against the same data. For a list
of the properties associated with detectors,
see <<ml-detectorconfig, Detector Configuration Objects>>.
Buckets::
The {xpackml} features use the concept of a bucket to divide the time
series into batches for processing. The _bucket span_ is part of the
configuration information for a job. It defines the time interval that is used
to summarize and model the data. This is typically between 5 minutes to 1 hour
and it depends on your data characteristics. When you set the bucket span,
take into account the granularity at which you want to analyze, the frequency
of the input data, the typical duration of the anomalies, and the frequency at
which alerting is required.
Machine learning nodes::
A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`,
which is the default behavior. If you set `node.ml` to `false`, the node can
service API requests but it cannot run jobs. If you want to use {xpackml}
features, there must be at least one {ml} node in your cluster. For more
information about this setting, see <<ml-settings>>.
--
include::overview.asciidoc[]
include::getting-started.asciidoc[]
// include::ml-scenarios.asciidoc[]
include::api-quickref.asciidoc[]

View File

@ -0,0 +1,65 @@
[[ml-concepts]]
== Overview
There are a few concepts that are core to {ml} in {xpack}. Understanding these
concepts from the outset will tremendously help ease the learning process.
[float]
[[ml-jobs]]
=== Jobs
Machine learning jobs contain the configuration information and metadata
necessary to perform an analytics task. For a list of the properties associated
with a job, see <<ml-job-resource, Job Resources>>.
[float]
[[ml-dfeeds]]
=== {dfeeds-cap}
Jobs can analyze either a one-off batch of data or continuously in real time.
{dfeeds-cap} retrieve data from {es} for analysis. Alternatively you can
<<ml-post-data,POST data>> from any source directly to an API.
[float]
[[ml-detectors]]
=== Detectors
As part of the configuration information that is associated with a job,
detectors define the type of analysis that needs to be done. They also specify
which fields to analyze. You can have more than one detector in a job, which
is more efficient than running multiple jobs against the same data. For a list
of the properties associated with detectors,
see <<ml-detectorconfig, Detector Configuration Objects>>.
[float]
[[ml-buckets]]
=== Buckets
The {xpackml} features use the concept of a bucket to divide the time
series into batches for processing. The _bucket span_ is part of the
configuration information for a job. It defines the time interval that is used
to summarize and model the data. This is typically between 5 minutes to 1 hour
and it depends on your data characteristics. When you set the bucket span,
take into account the granularity at which you want to analyze, the frequency
of the input data, the typical duration of the anomalies, and the frequency at
which alerting is required.
[float]
[[ml-nodes]]
=== Machine learning nodes
A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`,
which is the default behavior. If you set `node.ml` to `false`, the node can
service API requests but it cannot run jobs. If you want to use {xpackml}
features, there must be at least one {ml} node in your cluster. For more
information about this setting, see <<ml-settings>>.
include::functions.asciidoc[]
include::functions/count.asciidoc[]
include::functions/geo.asciidoc[]
include::functions/info.asciidoc[]
include::functions/metric.asciidoc[]
include::functions/rare.asciidoc[]
include::functions/sum.asciidoc[]
include::functions/time.asciidoc[]

View File

@ -182,7 +182,8 @@ NOTE: The `field_name` cannot contain double quotes or backslashes.
`function` (required)::
(string) The analysis function that is used.
For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`.
For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`. For more
information, see <<ml-functions>>.
`over_field_name`::
(string) The field used to split the data.