[DOCS] Add ML analytical functions (elastic/x-pack-elasticsearch#1319)

* [DOCS] Add ML analytical functions * [DOCS] Add pages for ML analytical functions * [DOCS] Add links to ML functions from API definitions Original commit: elastic/x-pack-elasticsearch@ae50b431d3
2017-05-05 10:40:17 -07:00 · 2017-05-05 10:40:17 -07:00 · 45cfc17ea1
parent 3570eb32d3
commit 45cfc17ea1
12 changed files with 418 additions and 44 deletions
--- a/docs/build.gradle
+++ b/docs/build.gradle
@ -10,6 +10,13 @@ apply plugin: 'elasticsearch.docs-test'
 * entirely and have a party! There will be cake and everything.... */
 buildRestTests.expectedUnconvertedCandidates = [
        'en/ml/getting-started.asciidoc',
        'en/ml/functions/count.asciidoc',
        'en/ml/functions/geo.asciidoc',
        'en/ml/functions/info.asciidoc',
        'en/ml/functions/metric.asciidoc',
        'en/ml/functions/rare.asciidoc',
        'en/ml/functions/sum.asciidoc',
        'en/ml/functions/time.asciidoc',
        'en/rest-api/security/users.asciidoc',
        'en/rest-api/security/tokens.asciidoc',
        'en/rest-api/watcher/put-watch.asciidoc',
--- a/docs/en/ml/functions.asciidoc
+++ b/docs/en/ml/functions.asciidoc
@ -0,0 +1,54 @@
 [float]
 [[ml-functions]]
 === Analytical Functions
 The {xpackml} features include analysis functions that provide a wide variety of
 flexible ways to analyze data for anomalies.
 When you create jobs, you specify one or more detectors, which define the type of
 analysis that needs to be done. If you are creating your job by using {ml} APIs,
 you specify the functions in <<ml-detectorconfig,Detector Configuration Objects>>.
 If you are creating your job in {kib}, you specify the functions differently
 depending on whether you are creating single metric, multi-metric, or advanced
 jobs. For a demonstration of creating jobs in {kib}, see <<ml-getting-started>>.
 //TBD: Determine what these fields are called in Kibana, for people who aren't using APIs
 ////
 TBD: Integrate from prelert docs?:
 By default, temporal (time-based) analysis is invoked, unless you also specify an
 `over_field_name`, which shifts the analysis to be population- or peer-based.
 When you specify `by_field_name` with a function, the analysis considers whether
 there is an anomaly for one of more specific values of `by_field_name`.
 NOTE: Some functions cannot be used with a `by_field_name` or `over_field_name`.
 You can specify a `partition_field_name` with any function. When this is used,
 the analysis is replicated for every distinct value of `partition_field_name`.
 You can specify a `summary_count_field_name` with any function except metric.
 When you use `summary_count_field_name`, the {ml} features expect the input
 data to be pre-summarized. The value of the `summary_count_field_name` field
 must contain the count of raw events that were summarized.
 Some functions can benefit from overlapping buckets. This improves the overall
 accuracy of the results but at the cost of a 2 bucket delay in seeing the results.
 ////
 Most functions detect anomalies in both low and high values. In statistical
 terminology, they apply a two-sided test. Some functions offer low and high
 variations (for example, `count`, `low_count`, and `high_count`). These variations
 apply one-sided tests, detecting anomalies only when the values are low or
 high, depending one which alternative is used.
 ////
 The table below provides a high-level summary of the analytical functions provided by the API. Each of the functions is described in detail over the following pages. Note the examples given in these pages use single Detector Configuration objects.
 ////
 * <<ml-count-functions>>
 * <<ml-geo-functions>>
 * <<ml-info-functions>>
 * <<ml-metric-functions>>
 * <<ml-rare-functions>>
 * <<ml-sum-functions>>
 * <<ml-time-functions>>
--- a/docs/en/ml/functions/count.asciidoc
+++ b/docs/en/ml/functions/count.asciidoc
@ -0,0 +1,129 @@
 [[ml-count-functions]]
 === Count Functions
 The {xpackml} features include the following count functions:
 * `count`, `high_count`, `low_count`
 * `non_zero_count`, `high_non_zero_count`, `low_non_zero_count`
 * `distinct_count`, `high_distinct_count`, `low_distinct_count`
 Count functions detect anomalies when the count of events in a bucket is
 anomalous.
 Use `non_zero_count` functions if your data is sparse and you want to ignore
 cases where the bucket count is zero.
 Use `distinct_count` functions to determine when the number of distinct values
 in one field is unusual, as opposed to the total count.
 Use high-sided functions if you want to monitor unusually high event rates.
 Use low-sided functions if you want to look at drops in event rate.
 ////
 * <<ml-count>>
 * <<ml-high-count>>
 * <<ml-low-count>>
 * <<ml-nonzero-count>>
 * <<ml-high-nonzero-count>>
 * <<ml-low-nonzero-count>>
 [float]
 [[ml-count]]
 ===== Count
 The `count` function detects anomalies when the count of events in a bucket is
 anomalous.
 * field_name: not applicable
 * by_field_name: optional
 * over_field_name: optional
 [source,js]
 --------------------------------------------------
 { "function" : "count" }
 --------------------------------------------------
 This example is probably the simplest possible analysis! It identifies time
 buckets during which the overall count of events is higher or lower than usual.
 It models the event rate and detects when the event rate is unusual compared to
 the past.
 [float]
 [[ml-high-count]]
 ===== High_count
 The `high_count` function detects anomalies when the count of events in a
 bucket are unusually high.
 * field_name: not applicable
 * by_field_name: optional
 * over_field_name: optional
 [source,js]
 --------------------------------------------------
 { "function" : "high_count", "byFieldName" : "error_code", "overFieldName": "user" }
 --------------------------------------------------
 This example models the event rate for each error code. It detects users that
 generate an unusually high count of error codes compared to other users.
 [float]
 [[ml-low-count]]
 ===== Low_count
 The `low_count` function detects anomalies when the count of events in a
 bucket are unusually low.
 * field_name: not applicable
 * by_field_name: optional
 * over_field_name: optional
 [source,js]
 --------------------------------------------------
 { "function" : "low_count", "byFieldName" : "status_code" }
 --------------------------------------------------
 In this example, there is a data stream that contains a field “status”. The
 function detects when the count of events for a given status code is lower than
 usual. It models the event rate for each status code and detects when a status
 code has an unusually low count compared to its past behavior.
 If the data stream consists of web server access log records, for example,
 a drop in the count of events for a particular status code might be an indication
 that something isn’t working correctly.
 [float]
 [[ml-nonzero-count]]
 ===== Non_zero_count
 non_zero_count:: count, but zeros are treated as null and ignored
 [float]
 [[ml-high-nonzero-count]]
 ===== High_non_zero_count
 high_non_zero_count::: count, but zeros are treated as null and ignored
 [float]
 [[ml-low-nonzero-count]]
 ===== Low_non_zero_count
 low_non_zero_count::: count, but zeros are treated as null and ignored
 [float]
 [[ml-low-count]]
 ===== Low_count
 distinct_count:: distinct count
 [float]
 [[ml-low-count]]
 ===== Low_count
 high_distinct_count::: distinct count
 [float]
 [[ml-low-count]]
 ===== Low_count
 low_distinct_count::: distinct count
 ////
--- a/docs/en/ml/functions/geo.asciidoc
+++ b/docs/en/ml/functions/geo.asciidoc
@ -0,0 +1,13 @@
 [[ml-geo-functions]]
 === Geographic Functions
 The {xpackml} features include the following geographic functions:
 * `lat_long`
 The geographic functions detect anomalies in the geographic location of the
 input data.
 The `field_name` that you supply must be a string of the form
 `latitude,longitude`. The `latitude` and `longitude` must be in the range -180
 to 180 and represent a point on the surface of the Earth.
--- a/docs/en/ml/functions/info.asciidoc
+++ b/docs/en/ml/functions/info.asciidoc
@ -0,0 +1,20 @@
 [[ml-info-functions]]
 === Information Content Functions
 The {xpackml} features include the following information content functions:
 * `info_content`, `high_info_content`, `low_info_content`
 The information content functions detect anomalies in the amount of information
 that is contained in strings within a bucket. These functions can be used as
 a more sophisticated method to identify incidences of data exfiltration or
 C2C activity, when analyzing the size in bytes of the data might not be sufficient.
 If you want to monitor for unusually high amounts of information, use `high_info_content`.
 If want to look at drops in information content, use `low_info_content`.
 ////
 info_content:: information content
 high_info_content::: information content
 ////
--- a/docs/en/ml/functions/metric.asciidoc
+++ b/docs/en/ml/functions/metric.asciidoc
@ -0,0 +1,36 @@
 [[ml-metric-functions]]
 === Metric Functions
 The {xpackml} features include the following metric functions:
 * `min`
 * `max`
 * `mean`, `high_mean`, `low_mean`
 * `metric`
 * `varp`, `high_varp`, `low_varp`
 The metric functions include mean, min and max. These values are calculated for each bucket.
 Field values that cannot be converted to double precision floating point numbers
 are ignored.
 ////
 metric:: all of mean, min, and max
 mean:: arithmetic mean
 high_mean::: arithmetic mean
 low_mean::: arithmetic mean
 median:: statistical median
 min:: arithmetic minimum
 max:: arithmetic maximum
 varp:: population variance
 high_varp::: ""
 low_varp::: ""
 ////
--- a/docs/en/ml/functions/rare.asciidoc
+++ b/docs/en/ml/functions/rare.asciidoc
@ -0,0 +1,34 @@
 [[ml-rare-functions]]
 === Rare Functions
 The {xpackml} features include the following rare functions:
 * `rare`
 * `freq_rare`
 The rare functions detect values that occur rarely in time or rarely for a
 population.
 The `rare` analysis detects anomalies according to the number of distinct rare
 values. This differs from `freq_rare`, which detects anomalies according to the
 number of times (frequency) rare values occur.
 [NOTE]
 ====
 * The `rare` and `freq_rare` functions should not be used in conjunction with
 `exclude_frequent`.
 * Shorter bucket spans (less than 1 hour, for example) are recommended when
 looking for rare events. The functions model whether something happens in a
 bucket at least once. With longer bucket spans, it is more likely that
 entities will be seen in a bucket and therefore they appear less rare.
 Picking the ideal the bucket span depends on the characteristics of the data
 with shorter bucket spans typically being measured in minutes, not hours.
 * To model rare data, a learning period of at least 20 buckets is required
 for typical data.
 ====
 ////
 rare:: rare items
 freq_rare:: frequently rare items
 ////
--- a/docs/en/ml/functions/sum.asciidoc
+++ b/docs/en/ml/functions/sum.asciidoc
@ -0,0 +1,26 @@
 [[ml-sum-functions]]
 === Sum Functions
 The {xpackml} features include the following sum functions:
 * `sum`, `high_sum`, `low_sum`
 * `non_null_sum`, `high_non_null_sum`, `low_non_null_sum`
 The sum functions detect anomalies when the sum of a field in a bucket is anomalous.
 Use high-sided functions if you want to monitor unusually high totals.
 Use low-sided functions if want to look at drops in totals.
 Use `non_null_sum` functions if your data is sparse. Buckets without values will
 be ignored; buckets with a zero value will be analyzed.
 NOTE: Input data can contain pre-calculated fields that give the total count of some value.  For
 example, transactions per minute.
 ////
 TBD: Incorporate from prelert docs?:
 Ensure you are familiar with our advice on Summarization of Input Data, as this is likely to provide
 a more appropriate method to using the sum function.
 ////
--- a/docs/en/ml/functions/time.asciidoc
+++ b/docs/en/ml/functions/time.asciidoc
@ -0,0 +1,31 @@
 [[ml-time-functions]]
 === Time Functions
 The {xpackml} features include the following time functions:
 * `time_of_day`
 * `time_of_week`
 The time functions detect events that happen at unusual times, either of the day
 or of the week. These functions can be used to find unusual patterns of behavior,
 typically associated with suspicious user activity.
 [NOTE]
 ====
 * The `time_of_day` function is not aware of the difference between days, for instance
 work days and weekends. When modeling different days, use the `time_of_week` function.
 In general, the `time_of_week` function is more suited to modeling the behavior of people
 rather than machines, as people vary their behavior according to the day of the week.
 * Shorter bucket spans (for example, 10 minutes) are recommended when performing a
 `time_of_day` or `time_of_week` analysis. The time of the events being modeled are not
 affected by the bucket span, but a shorter bucket span enables quicker alerting on unusual
 events.
 * Unusual events are flagged based on the previous pattern of the data, not on what we
 might think of as unusual based on human experience. So, if events typically occur
 between 3 a.m. and 5 a.m., and event occurring at 3 p.m. is be flagged as unusual.
 * When Daylight Saving Time starts or stops, regular events can be flagged as anomalous.
 This situation occurs because the actual time of the event (as measured against a UTC
 baseline) has changed. This situation is treated as a step change in behavior and the new
 times will be learned quickly.
 ====
--- a/docs/en/ml/index.asciidoc
+++ b/docs/en/ml/index.asciidoc
@ -25,51 +25,9 @@ configurations in order to get the benefits of {ml}.
 Machine learning is tightly integrated with the Elastic Stack. Data is pulled
 from {es} for analysis and anomaly results are displayed in {kib} dashboards.
 [float]
 [[ml-concepts]]
 == Basic Concepts
 There are a few concepts that are core to {ml} in {xpack}. Understanding these
 concepts from the outset will tremendously help ease the learning process.
 Jobs::
  Machine learning jobs contain the configuration information and metadata
  necessary to perform an analytics task. For a list of the properties associated
  with a job, see <<ml-job-resource, Job Resources>>.
 {dfeeds-cap}::
  Jobs can analyze either a one-off batch of data or continuously in real time.
  {dfeeds-cap} retrieve data from {es} for analysis. Alternatively you can
  <<ml-post-data,POST data>> from any source directly to an API.
 Detectors::
  As part of the configuration information that is associated with a job,
  detectors define the type of analysis that needs to be done. They also specify
  which fields to analyze. You can have more than one detector in a job, which
  is more efficient than running multiple jobs against the same data. For a list
  of the properties associated with detectors,
  see <<ml-detectorconfig, Detector Configuration Objects>>.
 Buckets::
  The {xpackml} features use the concept of a bucket to divide the time
  series into batches for processing. The _bucket span_ is part of the
  configuration information for a job. It defines the time interval that is used
  to summarize and model the data. This is typically between 5 minutes to 1 hour
  and it depends on your data characteristics. When you set the bucket span,
  take into account the granularity at which you want to analyze, the frequency
  of the input data, the typical duration of the anomalies, and the frequency at
  which alerting is required.
 Machine learning nodes::
  A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`,
  which is the default behavior. If you set `node.ml` to `false`, the node can
  service API requests but it cannot run jobs. If you want to use {xpackml}
  features, there must be at least one {ml} node in your cluster. For more
  information about this setting, see <<ml-settings>>.
 --
 include::overview.asciidoc[]
 include::getting-started.asciidoc[]
 // include::ml-scenarios.asciidoc[]
 include::api-quickref.asciidoc[]
--- a/docs/en/ml/overview.asciidoc
+++ b/docs/en/ml/overview.asciidoc
@ -0,0 +1,65 @@
 [[ml-concepts]]
 == Overview
 There are a few concepts that are core to {ml} in {xpack}. Understanding these
 concepts from the outset will tremendously help ease the learning process.
 [float]
 [[ml-jobs]]
 === Jobs
 Machine learning jobs contain the configuration information and metadata
 necessary to perform an analytics task. For a list of the properties associated
 with a job, see <<ml-job-resource, Job Resources>>.
 [float]
 [[ml-dfeeds]]
 === {dfeeds-cap}
 Jobs can analyze either a one-off batch of data or continuously in real time.
 {dfeeds-cap} retrieve data from {es} for analysis. Alternatively you can
 <<ml-post-data,POST data>> from any source directly to an API.
 [float]
 [[ml-detectors]]
 === Detectors
 As part of the configuration information that is associated with a job,
 detectors define the type of analysis that needs to be done. They also specify
 which fields to analyze. You can have more than one detector in a job, which
 is more efficient than running multiple jobs against the same data. For a list
 of the properties associated with detectors,
 see <<ml-detectorconfig, Detector Configuration Objects>>.
 [float]
 [[ml-buckets]]
 === Buckets
 The {xpackml} features use the concept of a bucket to divide the time
 series into batches for processing. The _bucket span_ is part of the
 configuration information for a job. It defines the time interval that is used
 to summarize and model the data. This is typically between 5 minutes to 1 hour
 and it depends on your data characteristics. When you set the bucket span,
 take into account the granularity at which you want to analyze, the frequency
 of the input data, the typical duration of the anomalies, and the frequency at
 which alerting is required.
 [float]
 [[ml-nodes]]
 === Machine learning nodes
 A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`,
 which is the default behavior. If you set `node.ml` to `false`, the node can
 service API requests but it cannot run jobs. If you want to use {xpackml}
 features, there must be at least one {ml} node in your cluster. For more
 information about this setting, see <<ml-settings>>.
 include::functions.asciidoc[]
 include::functions/count.asciidoc[]
 include::functions/geo.asciidoc[]
 include::functions/info.asciidoc[]
 include::functions/metric.asciidoc[]
 include::functions/rare.asciidoc[]
 include::functions/sum.asciidoc[]
 include::functions/time.asciidoc[]
--- a/docs/en/rest-api/ml/jobresource.asciidoc
+++ b/docs/en/rest-api/ml/jobresource.asciidoc
@ -182,7 +182,8 @@ NOTE: The `field_name` cannot contain double quotes or backslashes.
 `function` (required)::
  (string) The analysis function that is used.
-  For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`.
+  For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`. For more
  information, see <<ml-functions>>.
 `over_field_name`::
  (string) The field used to split the data.