[DOCS] Add details about ML count functions (elastic/x-pack-elasticsearch#1335)

* [DOCS] Add details about ML count functions * [DOCS] Address feedback in ML count functions * [DOCS] Clarify ML population analysis in non-zero count functions Original commit: elastic/x-pack-elasticsearch@24dbeba891
2017-06-01 14:16:14 -07:00 · 2017-06-01 14:16:14 -07:00 · 789cd66202
parent ff6fa6790e
commit 789cd66202
2 changed files with 205 additions and 100 deletions
--- a/docs/en/ml/functions.asciidoc
+++ b/docs/en/ml/functions.asciidoc
@ -12,36 +12,56 @@ If you are creating your job in {kib}, you specify the functions differently
 depending on whether you are creating single metric, multi-metric, or advanced
 jobs. For a demonstration of creating jobs in {kib}, see <<ml-getting-started>>.

-//TBD: Determine what these fields are called in Kibana, for people who aren't using APIs
-////
-TBD: Integrate from prelert docs?:
-By default, temporal (time-based) analysis is invoked, unless you also specify an
-`over_field_name`, which shifts the analysis to be population- or peer-based.
-
-When you specify `by_field_name` with a function, the analysis considers whether
-there is an anomaly for one of more specific values of `by_field_name`.
-
-NOTE: Some functions cannot be used with a `by_field_name` or `over_field_name`.
-
-You can specify a `partition_field_name` with any function. When this is used,
-the analysis is replicated for every distinct value of `partition_field_name`.
-
-You can specify a `summary_count_field_name` with any function except metric.
-When you use `summary_count_field_name`, the {ml} features expect the input
-data to be pre-summarized. The value of the `summary_count_field_name` field
-must contain the count of raw events that were summarized.
-
-Some functions can benefit from overlapping buckets. This improves the overall
-accuracy of the results but at the cost of a 2 bucket delay in seeing the results.
-////
-
 Most functions detect anomalies in both low and high values. In statistical
 terminology, they apply a two-sided test. Some functions offer low and high
 variations (for example, `count`, `low_count`, and `high_count`). These variations
 apply one-sided tests, detecting anomalies only when the values are low or
 high, depending one which alternative is used.

+//For some functions, you can optionally specify a field name in the
+//`by_field_name` property. The analysis then considers whether there is an
+//anomaly for one of more specific values of that field. In {kib}, use the
+//**Key Fields** field in multi-metric jobs or the **by_field_name** field in
+//advanced jobs.
 ////
+TODO: Per Sophie, "This is incorrect... Split Data refers to a partition_field_name. Over fields can only be added in Adv Config...
+
+Can you please remove the explanations for by/over/partition fields from the documentation for analytical functions. It's a complex topic and will be easier to review in a separate exercise."
+////
+
+//For some functions, you can also optionally specify a field name in the
+//`over_field_name` property. This property shifts the analysis to be population-
+//or peer-based and uses the field to split the data.  In {kib}, use the
+//**Split Data** field in multi-metric jobs or the **over_field_name** field in
+//advanced jobs.
+
+//You can specify a `partition_field_name` with any function. The analysis is then
+//segmented with completely independent baselines for each value of that field.
+//In {kib}, use the **partition_field_name** field in advanced jobs.
+
+You can specify a `summary_count_field_name` with any function except `metric`.
+When you use `summary_count_field_name`, the {ml} features expect the input
+data to be pre-aggregated. The value of the `summary_count_field_name` field
+must contain the count of raw events that were summarized. In {kib}, use the
+**summary_count_field_name** in advanced jobs. Analyzing aggregated input data
+provides a significant boost in performance.
+
+////
+TODO: Add link to aggregations topic when it is available.
+////
+
+If your data is sparse, there may be gaps in the data which means you might have
+empty buckets. You might want to treat these as anomalies or you might want these
+gaps to be ignored. Your decision depends on your use case and what is important
+to you. It also depends on which functions you use. The `sum` and `count`
+functions are strongly affected by empty buckets. For this reason, there are
+`non_null_sum` and `non_zero_count` functions, which are tolerant to sparse data.
+These functions effectively ignore empty buckets.
+
+////
+Some functions can benefit from overlapping buckets. This improves the overall
+accuracy of the results but at the cost of a 2 bucket delay in seeing the results.
+
 The table below provides a high-level summary of the analytical functions provided by the API. Each of the functions is described in detail over the following pages. Note the examples given in these pages use single Detector Configuration objects.
 ////

--- a/docs/en/ml/functions/count.asciidoc
+++ b/docs/en/ml/functions/count.asciidoc
@ -1,13 +1,7 @@
 [[ml-count-functions]]
 === Count Functions

-The {xpackml} features include the following count functions:
-
-* `count`, `high_count`, `low_count`
-* `non_zero_count`, `high_non_zero_count`, `low_non_zero_count`
-* `distinct_count`, `high_distinct_count`, `low_distinct_count`
-
-Count functions detect anomalies when the count of events in a bucket is
+Count functions detect anomalies when the number of events in a bucket is
 anomalous.

 Use `non_zero_count` functions if your data is sparse and you want to ignore
@ -19,111 +13,202 @@ in one field is unusual, as opposed to the total count.
 Use high-sided functions if you want to monitor unusually high event rates.
 Use low-sided functions if you want to look at drops in event rate.

+The {xpackml} features include the following count functions:

-////
-* <<ml-count>>
-* <<ml-high-count>>
-* <<ml-low-count>>
-* <<ml-nonzero-count>>
-* <<ml-high-nonzero-count>>
-* <<ml-low-nonzero-count>>
+* xref:ml-count[`count`, `high_count`, `low_count`]
+* xref:ml-nonzero-count[`non_zero_count`, `high_non_zero_count`, `low_non_zero_count`]
+* xref:ml-distinct-count[`distinct_count`, high_distinct_count`, `low_distinct_count`]

 [float]
 [[ml-count]]
-===== Count
+===== Count, High_count, Low_count

-The `count` function detects anomalies when the count of events in a bucket is
+The `count` function detects anomalies when the number of events in a bucket is
 anomalous.

-* field_name: not applicable
-* by_field_name: optional
-* over_field_name: optional
+The `high_count` function detects anomalies when the count of events in a
+bucket are unusually high.

+The `low_count` function detects anomalies when the count of events in a
+bucket are unusually low.
+
+These functions support the following properties:
+
+* `by_field_name` (optional)
+* `over_field_name` (optional)
+* `partition_field_name` (optional)
+
+For more information about those properties,
+see <<ml-detectorconfig,Detector Configuration Objects>>.
+
+.Example 1: Analyzing events with the count function
 [source,js]
 --------------------------------------------------
 { "function" : "count" }
 --------------------------------------------------

-This example is probably the simplest possible analysis! It identifies time
-buckets during which the overall count of events is higher or lower than usual.
+This example is probably the simplest possible analysis. It identifies
+time buckets during which the overall count of events is higher or lower than
+usual.

-It models the event rate and detects when the event rate is unusual compared to
-the past.
-
-[float]
-[[ml-high-count]]
-===== High_count
-
-The `high_count` function detects anomalies when the count of events in a
-bucket are unusually high.
-
-* field_name: not applicable
-* by_field_name: optional
-* over_field_name: optional
+When you use this function in a detector in your job, it models the event rate
+and detects when the event rate is unusual compared to its past behavior.

+.Example 2: Analyzing errors with the high_count function
 [source,js]
 --------------------------------------------------
-{ "function" : "high_count", "byFieldName" : "error_code", "overFieldName": "user" }
+{
+  "function" : "high_count",
+  "by_field_name" : "error_code",
+  "over_field_name": "user"
+}
 --------------------------------------------------

-This example models the event rate for each error code. It detects users that
-generate an unusually high count of error codes compared to other users.
+If you use this `high_count` function in a detector in your job, it
+models the event rate for each error code. It detects users that generate an
+unusually high count of error codes compared to other users.

-[float]
-[[ml-low-count]]
-===== Low_count
-
-The `low_count` function detects anomalies when the count of events in a
-bucket are unusually low.
-
-* field_name: not applicable
-* by_field_name: optional
-* over_field_name: optional

+.Example 3: Analyzing status codes with the low_count function
 [source,js]
 --------------------------------------------------
-{ "function" : "low_count", "byFieldName" : "status_code" }
+{
+  "function" : "low_count",
+  "by_field_name" : "status_code"
+}
 --------------------------------------------------

-In this example, there is a data stream that contains a field “status”. The
-function detects when the count of events for a given status code is lower than
-usual. It models the event rate for each status code and detects when a status
-code has an unusually low count compared to its past behavior.
+In this example, the function detects when the count of events for a
+status code is lower than usual.

-If the data stream consists of web server access log records, for example,
-a drop in the count of events for a particular status code might be an indication
-that something isn’t working correctly.
+When you use this function in a detector in your job, it models the event rate
+for each status code and detects when a status code has an unusually low count
+compared to its past behavior.
+
+.Example 4: Analyzing aggregated data with the count function
+[source,js]
+--------------------------------------------------
+{
+  "summary_count_field" : "events_per_min",
+  "detectors" [
+      { "function" : "count" }
+   ]
+}
+--------------------------------------------------
+
+If you are analyzing an aggregated `events_per_min` field, do not use a sum
+function (for example, `sum(events_per_min)`). Instead, use the count function
+and the `summary_count_field` property.
+//TO-DO: For more information, see <<aggreggations.asciidoc>>.

 [float]
 [[ml-nonzero-count]]
-===== Non_zero_count
+===== Non_zero_count, High_non_zero_count, Low_non_zero_count
+
+The `non_zero_count` function detects anomalies when the number of events in a
+bucket is anomalous, but it ignores cases where the bucket count is zero. Use
+this function if you know your data is sparse or has gaps and the gaps are not
+important.
+
+The `high_non_zero_count` function detects anomalies when the number of events
+in a bucket is unusually high and it ignores cases where the bucket count is
+zero.
+
+The `low_non_zero_count` function detects anomalies when the number of events in
+a bucket is unusually low and it ignores cases where the bucket count is zero.
+
+These functions support the following properties:
+
+* `by_field_name` (optional)
+* `partition_field_name` (optional)
+
+For more information about those properties,
+see <<ml-detectorconfig,Detector Configuration Objects>>.
+
+For example, if you have the following number of events per bucket:
+
+========================================
+
+1,22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,43,31,0,0,0,0,0,0,0,0,0,0,0,0,2,1
+
+========================================
+
+The `non_zero_count` function models only the following data:
+
+========================================
+
+1,22,2,43,31,2,1
+
+========================================
+
+.Example 5: Analyzing signatures with the high_non_zero_count function
+[source,js]
+--------------------------------------------------
+{
+  "function" : "high_non_zero_count",
+  "by_field_name" : "signaturename"
+}
+--------------------------------------------------
+
+If you use this `high_non_zero_count` function in a detector in your job, it
+models the count of events for the `signaturename` field. It ignores any buckets
+where the count is zero and detects when a `signaturename` value has an
+unusually high count of events compared to its past behavior.
+
+NOTE: Population analysis (using an `over_field_name` property value) is not
+supported for the `non_zero_count`, `high_non_zero_count`, and
+`low_non_zero_count` functions. If you want to do population analysis and your
+data is sparse, use the `count` functions, which are optimized for that scenario.

-non_zero_count:: count, but zeros are treated as null and ignored

 [float]
-[[ml-high-nonzero-count]]
-===== High_non_zero_count
+[[ml-distinct-count]]
+===== Distinct_count, High_distinct_count, Low_distinct_count

-high_non_zero_count::: count, but zeros are treated as null and ignored
+The `distinct_count` function detects anomalies where the number of distinct
+values in one field is unusual.

-[float]
-[[ml-low-nonzero-count]]
-===== Low_non_zero_count
+The `high_distinct_count` function detects unusually high numbers of distinct
+values in one field.

-low_non_zero_count::: count, but zeros are treated as null and ignored
+The `low_distinct_count` function detects unusually low numbers of distinct
+values in one field.

-[float]
-[[ml-low-count]]
-===== Low_count
-distinct_count:: distinct count
+These functions support the following properties:

-[float]
-[[ml-low-count]]
-===== Low_count
-high_distinct_count::: distinct count
+* `field_name` (required)
+* `by_field_name` (optional)
+* `over_field_name` (optional)
+* `partition_field_name` (optional)

-[float]
-[[ml-low-count]]
-===== Low_count
-low_distinct_count::: distinct count
-////
+For more information about those properties,
+see <<ml-detectorconfig,Detector Configuration Objects>>.
+
+.Example 6: Analyzing users with the distinct_count function
+[source,js]
+--------------------------------------------------
+{
+  "function" : "distinct_count",
+  "field_name" : "user"
+}
+--------------------------------------------------
+
+This `distinct_count` function detects when a system has an unusual number
+of logged in users. When you use this function in a detector in your job, it
+models the distinct count of users. It also detects when the distinct number of
+users is unusual compared to the past.
+
+.Example 7: Analyzing ports with the high_distinct_count function
+[source,js]
+--------------------------------------------------
+{
+  "function" : "high_distinct_count",
+  "field_name" : "dst_port",
+  "over_field_name": "src_ip"
+}
+--------------------------------------------------
+
+This example detects instances of port scanning. When you use this function in a
+detector in your job, it models the distinct count of ports. It also detects the
+`src_ip` values that connect to an unusually high number of different
+`dst_ports` values compared to other `src_ip` values.