From 45cfc17ea11e1fb601d752b40195020f8d439773 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Fri, 5 May 2017 10:40:17 -0700 Subject: [PATCH] [DOCS] Add ML analytical functions (elastic/x-pack-elasticsearch#1319) * [DOCS] Add ML analytical functions * [DOCS] Add pages for ML analytical functions * [DOCS] Add links to ML functions from API definitions Original commit: elastic/x-pack-elasticsearch@ae50b431d3e2ca6cfb61fde014cca2ae9fa0024f --- docs/build.gradle | 7 ++ docs/en/ml/functions.asciidoc | 54 ++++++++++ docs/en/ml/functions/count.asciidoc | 129 +++++++++++++++++++++++ docs/en/ml/functions/geo.asciidoc | 13 +++ docs/en/ml/functions/info.asciidoc | 20 ++++ docs/en/ml/functions/metric.asciidoc | 36 +++++++ docs/en/ml/functions/rare.asciidoc | 34 ++++++ docs/en/ml/functions/sum.asciidoc | 26 +++++ docs/en/ml/functions/time.asciidoc | 31 ++++++ docs/en/ml/index.asciidoc | 44 +------- docs/en/ml/overview.asciidoc | 65 ++++++++++++ docs/en/rest-api/ml/jobresource.asciidoc | 3 +- 12 files changed, 418 insertions(+), 44 deletions(-) create mode 100644 docs/en/ml/functions.asciidoc create mode 100644 docs/en/ml/functions/count.asciidoc create mode 100644 docs/en/ml/functions/geo.asciidoc create mode 100644 docs/en/ml/functions/info.asciidoc create mode 100644 docs/en/ml/functions/metric.asciidoc create mode 100644 docs/en/ml/functions/rare.asciidoc create mode 100644 docs/en/ml/functions/sum.asciidoc create mode 100644 docs/en/ml/functions/time.asciidoc create mode 100644 docs/en/ml/overview.asciidoc diff --git a/docs/build.gradle b/docs/build.gradle index e4fef5f3e93..0f1a0a04237 100644 --- a/docs/build.gradle +++ b/docs/build.gradle @@ -10,6 +10,13 @@ apply plugin: 'elasticsearch.docs-test' * entirely and have a party! There will be cake and everything.... */ buildRestTests.expectedUnconvertedCandidates = [ 'en/ml/getting-started.asciidoc', + 'en/ml/functions/count.asciidoc', + 'en/ml/functions/geo.asciidoc', + 'en/ml/functions/info.asciidoc', + 'en/ml/functions/metric.asciidoc', + 'en/ml/functions/rare.asciidoc', + 'en/ml/functions/sum.asciidoc', + 'en/ml/functions/time.asciidoc', 'en/rest-api/security/users.asciidoc', 'en/rest-api/security/tokens.asciidoc', 'en/rest-api/watcher/put-watch.asciidoc', diff --git a/docs/en/ml/functions.asciidoc b/docs/en/ml/functions.asciidoc new file mode 100644 index 00000000000..548b9c7e0f8 --- /dev/null +++ b/docs/en/ml/functions.asciidoc @@ -0,0 +1,54 @@ +[float] +[[ml-functions]] +=== Analytical Functions + +The {xpackml} features include analysis functions that provide a wide variety of +flexible ways to analyze data for anomalies. + +When you create jobs, you specify one or more detectors, which define the type of +analysis that needs to be done. If you are creating your job by using {ml} APIs, +you specify the functions in <>. +If you are creating your job in {kib}, you specify the functions differently +depending on whether you are creating single metric, multi-metric, or advanced +jobs. For a demonstration of creating jobs in {kib}, see <>. + +//TBD: Determine what these fields are called in Kibana, for people who aren't using APIs +//// +TBD: Integrate from prelert docs?: +By default, temporal (time-based) analysis is invoked, unless you also specify an +`over_field_name`, which shifts the analysis to be population- or peer-based. + +When you specify `by_field_name` with a function, the analysis considers whether +there is an anomaly for one of more specific values of `by_field_name`. + +NOTE: Some functions cannot be used with a `by_field_name` or `over_field_name`. + +You can specify a `partition_field_name` with any function. When this is used, +the analysis is replicated for every distinct value of `partition_field_name`. + +You can specify a `summary_count_field_name` with any function except metric. +When you use `summary_count_field_name`, the {ml} features expect the input +data to be pre-summarized. The value of the `summary_count_field_name` field +must contain the count of raw events that were summarized. + +Some functions can benefit from overlapping buckets. This improves the overall +accuracy of the results but at the cost of a 2 bucket delay in seeing the results. +//// + +Most functions detect anomalies in both low and high values. In statistical +terminology, they apply a two-sided test. Some functions offer low and high +variations (for example, `count`, `low_count`, and `high_count`). These variations +apply one-sided tests, detecting anomalies only when the values are low or +high, depending one which alternative is used. + +//// +The table below provides a high-level summary of the analytical functions provided by the API. Each of the functions is described in detail over the following pages. Note the examples given in these pages use single Detector Configuration objects. +//// + +* <> +* <> +* <> +* <> +* <> +* <> +* <> diff --git a/docs/en/ml/functions/count.asciidoc b/docs/en/ml/functions/count.asciidoc new file mode 100644 index 00000000000..81074739349 --- /dev/null +++ b/docs/en/ml/functions/count.asciidoc @@ -0,0 +1,129 @@ +[[ml-count-functions]] +=== Count Functions + +The {xpackml} features include the following count functions: + +* `count`, `high_count`, `low_count` +* `non_zero_count`, `high_non_zero_count`, `low_non_zero_count` +* `distinct_count`, `high_distinct_count`, `low_distinct_count` + +Count functions detect anomalies when the count of events in a bucket is +anomalous. + +Use `non_zero_count` functions if your data is sparse and you want to ignore +cases where the bucket count is zero. + +Use `distinct_count` functions to determine when the number of distinct values +in one field is unusual, as opposed to the total count. + +Use high-sided functions if you want to monitor unusually high event rates. +Use low-sided functions if you want to look at drops in event rate. + + +//// +* <> +* <> +* <> +* <> +* <> +* <> + +[float] +[[ml-count]] +===== Count + +The `count` function detects anomalies when the count of events in a bucket is +anomalous. + +* field_name: not applicable +* by_field_name: optional +* over_field_name: optional + +[source,js] +-------------------------------------------------- +{ "function" : "count" } +-------------------------------------------------- + +This example is probably the simplest possible analysis! It identifies time +buckets during which the overall count of events is higher or lower than usual. + +It models the event rate and detects when the event rate is unusual compared to +the past. + +[float] +[[ml-high-count]] +===== High_count + +The `high_count` function detects anomalies when the count of events in a +bucket are unusually high. + +* field_name: not applicable +* by_field_name: optional +* over_field_name: optional + +[source,js] +-------------------------------------------------- +{ "function" : "high_count", "byFieldName" : "error_code", "overFieldName": "user" } +-------------------------------------------------- + +This example models the event rate for each error code. It detects users that +generate an unusually high count of error codes compared to other users. + +[float] +[[ml-low-count]] +===== Low_count + +The `low_count` function detects anomalies when the count of events in a +bucket are unusually low. + +* field_name: not applicable +* by_field_name: optional +* over_field_name: optional + +[source,js] +-------------------------------------------------- +{ "function" : "low_count", "byFieldName" : "status_code" } +-------------------------------------------------- + +In this example, there is a data stream that contains a field “status”. The +function detects when the count of events for a given status code is lower than +usual. It models the event rate for each status code and detects when a status +code has an unusually low count compared to its past behavior. + +If the data stream consists of web server access log records, for example, +a drop in the count of events for a particular status code might be an indication +that something isn’t working correctly. + +[float] +[[ml-nonzero-count]] +===== Non_zero_count + +non_zero_count:: count, but zeros are treated as null and ignored + +[float] +[[ml-high-nonzero-count]] +===== High_non_zero_count + +high_non_zero_count::: count, but zeros are treated as null and ignored + +[float] +[[ml-low-nonzero-count]] +===== Low_non_zero_count + +low_non_zero_count::: count, but zeros are treated as null and ignored + +[float] +[[ml-low-count]] +===== Low_count +distinct_count:: distinct count + +[float] +[[ml-low-count]] +===== Low_count +high_distinct_count::: distinct count + +[float] +[[ml-low-count]] +===== Low_count +low_distinct_count::: distinct count +//// diff --git a/docs/en/ml/functions/geo.asciidoc b/docs/en/ml/functions/geo.asciidoc new file mode 100644 index 00000000000..0a4854fd99b --- /dev/null +++ b/docs/en/ml/functions/geo.asciidoc @@ -0,0 +1,13 @@ +[[ml-geo-functions]] +=== Geographic Functions + +The {xpackml} features include the following geographic functions: + +* `lat_long` + +The geographic functions detect anomalies in the geographic location of the +input data. + +The `field_name` that you supply must be a string of the form +`latitude,longitude`. The `latitude` and `longitude` must be in the range -180 +to 180 and represent a point on the surface of the Earth. diff --git a/docs/en/ml/functions/info.asciidoc b/docs/en/ml/functions/info.asciidoc new file mode 100644 index 00000000000..f084d6c09f9 --- /dev/null +++ b/docs/en/ml/functions/info.asciidoc @@ -0,0 +1,20 @@ +[[ml-info-functions]] +=== Information Content Functions + +The {xpackml} features include the following information content functions: + +* `info_content`, `high_info_content`, `low_info_content` + +The information content functions detect anomalies in the amount of information +that is contained in strings within a bucket. These functions can be used as +a more sophisticated method to identify incidences of data exfiltration or +C2C activity, when analyzing the size in bytes of the data might not be sufficient. + +If you want to monitor for unusually high amounts of information, use `high_info_content`. +If want to look at drops in information content, use `low_info_content`. + +//// +info_content:: information content + +high_info_content::: information content +//// diff --git a/docs/en/ml/functions/metric.asciidoc b/docs/en/ml/functions/metric.asciidoc new file mode 100644 index 00000000000..7ea7c2baa8b --- /dev/null +++ b/docs/en/ml/functions/metric.asciidoc @@ -0,0 +1,36 @@ +[[ml-metric-functions]] +=== Metric Functions + +The {xpackml} features include the following metric functions: + +* `min` +* `max` +* `mean`, `high_mean`, `low_mean` +* `metric` +* `varp`, `high_varp`, `low_varp` + +The metric functions include mean, min and max. These values are calculated for each bucket. +Field values that cannot be converted to double precision floating point numbers +are ignored. + +//// +metric:: all of mean, min, and max + +mean:: arithmetic mean + +high_mean::: arithmetic mean + +low_mean::: arithmetic mean + +median:: statistical median + +min:: arithmetic minimum + +max:: arithmetic maximum + +varp:: population variance + +high_varp::: "" + +low_varp::: "" +//// diff --git a/docs/en/ml/functions/rare.asciidoc b/docs/en/ml/functions/rare.asciidoc new file mode 100644 index 00000000000..362e2e67171 --- /dev/null +++ b/docs/en/ml/functions/rare.asciidoc @@ -0,0 +1,34 @@ +[[ml-rare-functions]] +=== Rare Functions + +The {xpackml} features include the following rare functions: + +* `rare` +* `freq_rare` + +The rare functions detect values that occur rarely in time or rarely for a +population. + +The `rare` analysis detects anomalies according to the number of distinct rare +values. This differs from `freq_rare`, which detects anomalies according to the +number of times (frequency) rare values occur. + +[NOTE] +==== +* The `rare` and `freq_rare` functions should not be used in conjunction with +`exclude_frequent`. +* Shorter bucket spans (less than 1 hour, for example) are recommended when +looking for rare events. The functions model whether something happens in a +bucket at least once. With longer bucket spans, it is more likely that +entities will be seen in a bucket and therefore they appear less rare. +Picking the ideal the bucket span depends on the characteristics of the data +with shorter bucket spans typically being measured in minutes, not hours. +* To model rare data, a learning period of at least 20 buckets is required +for typical data. +==== + +//// +rare:: rare items + +freq_rare:: frequently rare items +//// diff --git a/docs/en/ml/functions/sum.asciidoc b/docs/en/ml/functions/sum.asciidoc new file mode 100644 index 00000000000..d587aa31839 --- /dev/null +++ b/docs/en/ml/functions/sum.asciidoc @@ -0,0 +1,26 @@ + +[[ml-sum-functions]] +=== Sum Functions + +The {xpackml} features include the following sum functions: + +* `sum`, `high_sum`, `low_sum` +* `non_null_sum`, `high_non_null_sum`, `low_non_null_sum` + +The sum functions detect anomalies when the sum of a field in a bucket is anomalous. + +Use high-sided functions if you want to monitor unusually high totals. + +Use low-sided functions if want to look at drops in totals. + +Use `non_null_sum` functions if your data is sparse. Buckets without values will +be ignored; buckets with a zero value will be analyzed. + +NOTE: Input data can contain pre-calculated fields that give the total count of some value. For +example, transactions per minute. + +//// +TBD: Incorporate from prelert docs?: +Ensure you are familiar with our advice on Summarization of Input Data, as this is likely to provide +a more appropriate method to using the sum function. +//// diff --git a/docs/en/ml/functions/time.asciidoc b/docs/en/ml/functions/time.asciidoc new file mode 100644 index 00000000000..cd3f9f7ec20 --- /dev/null +++ b/docs/en/ml/functions/time.asciidoc @@ -0,0 +1,31 @@ +[[ml-time-functions]] +=== Time Functions + +The {xpackml} features include the following time functions: + +* `time_of_day` +* `time_of_week` + +The time functions detect events that happen at unusual times, either of the day +or of the week. These functions can be used to find unusual patterns of behavior, +typically associated with suspicious user activity. + + +[NOTE] +==== +* The `time_of_day` function is not aware of the difference between days, for instance +work days and weekends. When modeling different days, use the `time_of_week` function. +In general, the `time_of_week` function is more suited to modeling the behavior of people +rather than machines, as people vary their behavior according to the day of the week. +* Shorter bucket spans (for example, 10 minutes) are recommended when performing a +`time_of_day` or `time_of_week` analysis. The time of the events being modeled are not +affected by the bucket span, but a shorter bucket span enables quicker alerting on unusual +events. +* Unusual events are flagged based on the previous pattern of the data, not on what we +might think of as unusual based on human experience. So, if events typically occur +between 3 a.m. and 5 a.m., and event occurring at 3 p.m. is be flagged as unusual. +* When Daylight Saving Time starts or stops, regular events can be flagged as anomalous. +This situation occurs because the actual time of the event (as measured against a UTC +baseline) has changed. This situation is treated as a step change in behavior and the new +times will be learned quickly. +==== diff --git a/docs/en/ml/index.asciidoc b/docs/en/ml/index.asciidoc index 9a7a1227205..7ff83513cac 100644 --- a/docs/en/ml/index.asciidoc +++ b/docs/en/ml/index.asciidoc @@ -25,51 +25,9 @@ configurations in order to get the benefits of {ml}. Machine learning is tightly integrated with the Elastic Stack. Data is pulled from {es} for analysis and anomaly results are displayed in {kib} dashboards. -[float] -[[ml-concepts]] -== Basic Concepts - -There are a few concepts that are core to {ml} in {xpack}. Understanding these -concepts from the outset will tremendously help ease the learning process. - -Jobs:: - Machine learning jobs contain the configuration information and metadata - necessary to perform an analytics task. For a list of the properties associated - with a job, see <>. - -{dfeeds-cap}:: - Jobs can analyze either a one-off batch of data or continuously in real time. - {dfeeds-cap} retrieve data from {es} for analysis. Alternatively you can - <> from any source directly to an API. - -Detectors:: - As part of the configuration information that is associated with a job, - detectors define the type of analysis that needs to be done. They also specify - which fields to analyze. You can have more than one detector in a job, which - is more efficient than running multiple jobs against the same data. For a list - of the properties associated with detectors, - see <>. - -Buckets:: - The {xpackml} features use the concept of a bucket to divide the time - series into batches for processing. The _bucket span_ is part of the - configuration information for a job. It defines the time interval that is used - to summarize and model the data. This is typically between 5 minutes to 1 hour - and it depends on your data characteristics. When you set the bucket span, - take into account the granularity at which you want to analyze, the frequency - of the input data, the typical duration of the anomalies, and the frequency at - which alerting is required. - -Machine learning nodes:: - A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`, - which is the default behavior. If you set `node.ml` to `false`, the node can - service API requests but it cannot run jobs. If you want to use {xpackml} - features, there must be at least one {ml} node in your cluster. For more - information about this setting, see <>. - - -- +include::overview.asciidoc[] include::getting-started.asciidoc[] // include::ml-scenarios.asciidoc[] include::api-quickref.asciidoc[] diff --git a/docs/en/ml/overview.asciidoc b/docs/en/ml/overview.asciidoc new file mode 100644 index 00000000000..1741fd1aa60 --- /dev/null +++ b/docs/en/ml/overview.asciidoc @@ -0,0 +1,65 @@ +[[ml-concepts]] +== Overview + +There are a few concepts that are core to {ml} in {xpack}. Understanding these +concepts from the outset will tremendously help ease the learning process. + +[float] +[[ml-jobs]] +=== Jobs + +Machine learning jobs contain the configuration information and metadata +necessary to perform an analytics task. For a list of the properties associated +with a job, see <>. + +[float] +[[ml-dfeeds]] +=== {dfeeds-cap} + +Jobs can analyze either a one-off batch of data or continuously in real time. +{dfeeds-cap} retrieve data from {es} for analysis. Alternatively you can +<> from any source directly to an API. + +[float] +[[ml-detectors]] +=== Detectors + +As part of the configuration information that is associated with a job, +detectors define the type of analysis that needs to be done. They also specify +which fields to analyze. You can have more than one detector in a job, which +is more efficient than running multiple jobs against the same data. For a list +of the properties associated with detectors, +see <>. + +[float] +[[ml-buckets]] +=== Buckets + +The {xpackml} features use the concept of a bucket to divide the time +series into batches for processing. The _bucket span_ is part of the +configuration information for a job. It defines the time interval that is used +to summarize and model the data. This is typically between 5 minutes to 1 hour +and it depends on your data characteristics. When you set the bucket span, +take into account the granularity at which you want to analyze, the frequency +of the input data, the typical duration of the anomalies, and the frequency at +which alerting is required. + +[float] +[[ml-nodes]] +=== Machine learning nodes + +A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`, +which is the default behavior. If you set `node.ml` to `false`, the node can +service API requests but it cannot run jobs. If you want to use {xpackml} +features, there must be at least one {ml} node in your cluster. For more +information about this setting, see <>. + +include::functions.asciidoc[] + +include::functions/count.asciidoc[] +include::functions/geo.asciidoc[] +include::functions/info.asciidoc[] +include::functions/metric.asciidoc[] +include::functions/rare.asciidoc[] +include::functions/sum.asciidoc[] +include::functions/time.asciidoc[] diff --git a/docs/en/rest-api/ml/jobresource.asciidoc b/docs/en/rest-api/ml/jobresource.asciidoc index 38ee2ec0cd5..aedc13e9b91 100644 --- a/docs/en/rest-api/ml/jobresource.asciidoc +++ b/docs/en/rest-api/ml/jobresource.asciidoc @@ -182,7 +182,8 @@ NOTE: The `field_name` cannot contain double quotes or backslashes. `function` (required):: (string) The analysis function that is used. - For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`. + For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`. For more + information, see <>. `over_field_name`:: (string) The field used to split the data.