[7.x][DOCS] Augments ML shared definitions (#50487)

2019-12-24 10:22:05 -08:00 · 2019-12-24 10:22:05 -08:00 · d479e0563a
parent f57569bf5c
commit d479e0563a
1 changed files with 227 additions and 169 deletions
--- a/docs/reference/ml/ml-shared.asciidoc
+++ b/docs/reference/ml/ml-shared.asciidoc
@ -1,3 +1,10 @@
+tag::aggregations[]
+If set, the {dfeed} performs aggregation searches. Support for aggregations is
+limited and should only be used with low cardinality data. For more information,
+see
+{ml-docs}/ml-configuring-aggregation.html[Aggregating data for faster performance].
+end::aggregations[]
+
 tag::allow-lazy-open[]
 Advanced configuration option. Specifies whether this job can open when there is 
 insufficient {ml} node capacity for it to be immediately assigned to a node. The
@ -21,6 +28,21 @@ subject to the cluster-wide `xpack.ml.max_lazy_ml_nodes` setting - see
 `starting` state until sufficient {ml} node capacity is available.
 end::allow-lazy-start[]

+tag::allow-no-datafeeds[]
+Specifies what to do when the request:
+
+--
+* Contains wildcard expressions and there are no {dfeeds} that match.
+* Contains the `_all` string or no identifiers and there are no matches.
+* Contains wildcard expressions and there are only partial matches. 
+
+The default value is `true`, which returns an empty `datafeeds` array when
+there are no matches and the subset of results when there are partial matches.
+If this parameter is `false`, the request returns a `404` status code when there
+are no matches or only partial matches.
+--
+end::allow-no-datafeeds[]
+
 tag::allow-no-jobs[]
 Specifies what to do when the request:
 +
@ -57,71 +79,16 @@ example: `outlier_detection`. See <<ml-dfa-analysis-objects>>.
 end::analysis[]

 tag::analysis-config[]
-The analysis configuration, which specifies how to analyze the data.
-After you create a job, you cannot change the analysis configuration; all
-the properties are informational. An analysis configuration object has the 
-following properties:
-
-`bucket_span`:::
-(<<time-units,time units>>)
-include::{docdir}/ml/ml-shared.asciidoc[tag=bucket-span]
-
-`categorization_field_name`:::
-(string)
-include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-field-name]
-
-`categorization_filters`:::
-(array of strings)
-include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-filters]
-
-`categorization_analyzer`:::
-(object or string)
-include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-analyzer]
-
-`detectors`:::
-(array) An array of detector configuration objects. Detector configuration
-objects specify which data fields a job analyzes. They also specify which
-analytical functions are used. You can specify multiple detectors for a job. 
-include::{docdir}/ml/ml-shared.asciidoc[tag=detector]
-+
--
-NOTE: If the `detectors` array does not contain at least one detector,
-no analysis can occur and an error is returned.
-
--
-
-`influencers`:::
-(array of strings)
-include::{docdir}/ml/ml-shared.asciidoc[tag=influencers]
-
-`latency`:::
-(time units)
-include::{docdir}/ml/ml-shared.asciidoc[tag=latency]
-
-`multivariate_by_fields`:::
-(boolean)
-include::{docdir}/ml/ml-shared.asciidoc[tag=multivariate-by-fields]
-
-`summary_count_field_name`:::
-(string)
-include::{docdir}/ml/ml-shared.asciidoc[tag=summary-count-field-name]
-
+The analysis configuration, which specifies how to analyze the data. After you
+create a job, you cannot change the analysis configuration; all the properties
+are informational.
 end::analysis-config[]

 tag::analysis-limits[]
 Limits can be applied for the resources required to hold the mathematical models
 in memory. These limits are approximate and can be set per job. They do not
-control the memory used by other processes, for example the {es} Java
-processes. If necessary, you can increase the limits after the job is created.
-The `analysis_limits` object has the following properties:
-
-`categorization_examples_limit`:::
-(long)
-include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-examples-limit]
-
-`model_memory_limit`:::
-(long or string) 
-include::{docdir}/ml/ml-shared.asciidoc[tag=model-memory-limit]
+control the memory used by other processes, for example the {es} Java processes.
+If necessary, you can increase the limits after the job is created.
 end::analysis-limits[]

 tag::analyzed-fields[]
@ -142,7 +109,6 @@ see the <<explain-dfanalytics>> which helps understand field selection.
  automatically.
 end::analyzed-fields[]

-
 tag::background-persist-interval[]
 Advanced configuration option. The time between each periodic persistence of the 
 model. The default value is a randomized value between 3 to 4 hours, which
@ -162,6 +128,11 @@ The size of the interval that the analysis is aggregated into, typically between
 see <<time-units>>.
 end::bucket-span[]

+tag::bucket-span-results[]
+The length of the bucket in seconds. This value matches the `bucket_span`
+that is specified in the job.
+end::bucket-span-results[]
+
 tag::by-field-name[]
 The field used to split the data. In particular, this property is used for 
 analyzing the splits with respect to their own history. It is used for finding 
@ -184,15 +155,15 @@ object. If it is a string it must refer to a
 is an object it has the following properties:
 --

-`char_filter`::::
+`analysis_config`.`categorization_analyzer`.`char_filter`::::
 (array of strings or objects)
 include::{docdir}/ml/ml-shared.asciidoc[tag=char-filter]

-`tokenizer`::::
+`analysis_config`.`categorization_analyzer`.`tokenizer`::::
 (string or object)
 include::{docdir}/ml/ml-shared.asciidoc[tag=tokenizer]

-`filter`::::
+`analysis_config`.`categorization_analyzer`.`filter`::::
 (array of strings or objects)
 include::{docdir}/ml/ml-shared.asciidoc[tag=filter]
 end::categorization-analyzer[]
@ -246,6 +217,22 @@ add them here as
 <<analysis-pattern-replace-charfilter,pattern replace character filters>>.
 end::char-filter[]

+tag::chunking-config[]
+{dfeeds-cap} might be required to search over long time periods, for several months
+or years. This search is split into time chunks in order to ensure the load
+on {es} is managed. Chunking configuration controls how the size of these time
+chunks are calculated and is an advanced configuration option.
+A chunking configuration object has the following properties:
+
+`chunking_config`.`mode`:::
+(string)
+include::{docdir}/ml/ml-shared.asciidoc[tag=mode]
+
+`chunking_config`.`time_span`:::
+(<<time-units,time units>>)
+include::{docdir}/ml/ml-shared.asciidoc[tag=time-span]
+end::chunking-config[]
+
 tag::compute-feature-influence[]
 If `true`, the feature influence calculation is enabled. Defaults to `true`.
 end::compute-feature-influence[]
@ -255,11 +242,10 @@ An array of custom rule objects, which enable you to customize the way detectors
 operate. For example, a rule may dictate to the detector conditions under which
 results should be skipped. For more examples, see 
 {ml-docs}/ml-configuring-detector-custom-rules.html[Customizing detectors with custom rules].
-A custom rule has the following properties:
-+
--
-`actions`::
-(array) The set of actions to be triggered when the rule applies. If
+end::custom-rules[]
+
+tag::custom-rules-actions[]
+The set of actions to be triggered when the rule applies. If
 more than one action is specified the effects of all actions are combined. The
 available actions include:

@ -271,49 +257,47 @@ model. Unless you also specify `skip_result`, the results will be created as
 usual. This action is suitable when certain values are expected to be
 consistently anomalous and they affect the model in a way that negatively
 impacts the rest of the results.
+end::custom-rules-actions[]

-`scope`::
-(object) An optional scope of series where the rule applies. A rule must either
+tag::custom-rules-scope[]
+An optional scope of series where the rule applies. A rule must either
 have a non-empty scope or at least one condition. By default, the scope includes
 all series. Scoping is allowed for any of the fields that are also specified in
 `by_field_name`, `over_field_name`, or `partition_field_name`. To add a scope
 for a field, add the field name as a key in the scope object and set its value
 to an object with the following properties:
+end::custom-rules-scope[]

-`filter_id`:::
-(string) The id of the filter to be used.
+tag::custom-rules-scope-filter-id[]
+The id of the filter to be used.
+end::custom-rules-scope-filter-id[]

-`filter_type`:::
-(string) Either `include` (the rule applies for values in the filter) or
-`exclude` (the rule applies for values not in the filter). Defaults to 
-`include`.
+tag::custom-rules-scope-filter-type[]
+Either `include` (the rule applies for values in the filter) or `exclude` (the
+rule applies for values not in the filter). Defaults to `include`.
+end::custom-rules-scope-filter-type[]

-`conditions`::
-(array) An optional array of numeric conditions when the rule applies. A rule
-must either have a non-empty scope or at least one condition. Multiple
-conditions are combined together with a logical `AND`. A condition has the
-following properties: 
+tag::custom-rules-conditions[]
+An optional array of numeric conditions when the rule applies. A rule must
+either have a non-empty scope or at least one condition. Multiple conditions are
+combined together with a logical `AND`. A condition has the following properties:
+end::custom-rules-conditions[]

-`applies_to`:::
-(string) Specifies the result property to which the condition applies. The
-available options are `actual`, `typical`, `diff_from_typical`, `time`.
+tag::custom-rules-conditions-applies-to[]
+Specifies the result property to which the condition applies. The available
+options are `actual`, `typical`, `diff_from_typical`, `time`. If your detector
+uses `lat_long`, `metric`, `rare`, or `freq_rare` functions, you can only
+specify conditions that apply to `time`.
+end::custom-rules-conditions-applies-to[]

-`operator`:::
-(string) Specifies the condition operator. The available options are `gt`
-(greater than), `gte` (greater than or equals), `lt` (less than) and `lte` (less
-than or equals).
+tag::custom-rules-conditions-operator[]
+Specifies the condition operator. The available options are `gt` (greater than),
+`gte` (greater than or equals), `lt` (less than) and `lte` (less than or equals).
+end::custom-rules-conditions-operator[]

-`value`:::
-(double) The value that is compared against the `applies_to` field using the
-`operator`.
--
-+
--
-NOTE: If your detector uses `lat_long`, `metric`, `rare`, or `freq_rare`
-functions, you can only specify `conditions` that apply to `time`.
-
--
-end::custom-rules[]
+tag::custom-rules-conditions-value[]
+The value that is compared against the `applies_to` field using the `operator`.
+end::custom-rules-conditions-value[]

 tag::custom-settings[]
 Advanced configuration option. Contains custom meta data about the job. For
@ -330,16 +314,14 @@ a {dfeed}, these properties are automatically set.
 When data is received via the <<ml-post-data,post data>> API, it is not stored
 in {es}. Only the results for {anomaly-detect} are retained.

-A data description object has the following properties:
-
-`format`:::
+`data_description`.`format`:::
  (string) Only `JSON` format is supported at this time.

-`time_field`:::
+`data_description`.`time_field`:::
  (string) The name of the field that contains the timestamp.
  The default value is `time`.

-`time_format`:::
+`data_description`.`time_format`:::
 (string)
 include::{docdir}/ml/ml-shared.asciidoc[tag=time-format]
 --
@ -444,8 +426,8 @@ expression.
 end::datafeed-id-wildcard[]

 tag::decompress-definition[]
-Specifies whether the included model definition should be returned as a JSON map (`true`) or 
-in a custom compressed format (`false`). Defaults to `true`.
+Specifies whether the included model definition should be returned as a JSON map 
+(`true`) or in a custom compressed format (`false`). Defaults to `true`.
 end::decompress-definition[]

 tag::delayed-data-check-config[]
@ -462,13 +444,11 @@ moment in time. See

 This check runs only on real-time {dfeeds}.

-The configuration object has the following properties:
-
-`enabled`::
+`delayed_data_check_config`.`enabled`::
 (boolean) Specifies whether the {dfeed} periodically checks for delayed data.
 Defaults to `true`.

-`check_window`::
+`delayed_data_check_config`.`check_window`::
 (<<time-units,time units>>) The window of time that is searched for late data.
 This window of time ends with the latest finalized bucket. It defaults to
 `null`, which causes an appropriate `check_window` to be calculated when the
@ -485,6 +465,10 @@ that document will not be used for training, but a prediction with the trained
 model will be generated for it. It is also known as continuous target variable.
 end::dependent-variable[]

+tag::desc-results[]
+If true, the results are sorted in descending order.
+end::desc-results[]
+
 tag::description-dfa[]
 A description of the job.
 end::description-dfa[]
@ -502,26 +486,6 @@ optionally `results_field` (`ml` by default).
    results of the analysis. Default to `ml`.
 end::dest[]

-tag::detector-description[]
-A description of the detector. For example, `Low event rate`.
-end::detector-description[]
-
-tag::detector-field-name[]
-The field that the detector uses in the function. If you use an event rate 
-function such as `count` or `rare`, do not specify this field.
-+
--
-NOTE: The `field_name` cannot contain double quotes or backslashes.
-
--
-end::detector-field-name[]
-
-tag::detector-index[]
-A unique identifier for the detector. This identifier is based on the order of 
-the detectors in the `analysis_config`, starting at zero. You can use this 
-identifier when you want to update a specific detector.
-end::detector-index[]
-
 tag::detector[]
 A detector has the following properties:

@ -567,6 +531,26 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=use-null]

 end::detector[]

+tag::detector-description[]
+A description of the detector. For example, `Low event rate`.
+end::detector-description[]
+
+tag::detector-field-name[]
+The field that the detector uses in the function. If you use an event rate 
+function such as `count` or `rare`, do not specify this field.
+
+--
+NOTE: The `field_name` cannot contain double quotes or backslashes.
+
+--
+end::detector-field-name[]
+
+tag::detector-index[]
+A unique identifier for the detector. This identifier is based on the order of 
+the detectors in the `analysis_config`, starting at zero. You can use this 
+identifier when you want to update a specific detector.
+end::detector-index[]
+
 tag::eta[]
 The shrinkage applied to the weights. Smaller values result 
 in larger forests which have better generalization error. However, the smaller 
@ -583,6 +567,11 @@ working with both over and by fields, then you can set `exclude_frequent` to
 `all` for both fields, or to `by` or `over` for those specific fields.
 end::exclude-frequent[]

+tag::exclude-interim-results[]
+If `true`, the output excludes interim results. By default, interim results are 
+included.
+end::exclude-interim-results[]
+
 tag::feature-bag-fraction[]
 Defines the fraction of features that will be used when 
 selecting a random bag for each candidate split. 
@ -624,6 +613,13 @@ optional. If it is not specified, no token filters are applied prior to
 categorization.
 end::filter[]

+tag::frequency[]
+The interval at which scheduled queries are made while the {dfeed} runs in real
+time. The default value is either the bucket span for short bucket spans, or,
+for longer bucket spans, a sensible fraction of the bucket span. For example:
+`150s`.
+end::frequency[]
+
 tag::from[]
 Skips the specified number of {dfanalytics-jobs}. The default value is `0`.
 end::from[]
@ -671,24 +667,26 @@ is available as part of the input data. When you use multiple detectors, the use
 of influencers is recommended as it aggregates results for each influencer entity.
 end::influencers[]

+tag::is-interim[]
+If `true`, this is an interim result. In other words, the results are calculated
+based on partial input data.
+end::is-interim[]
+
 tag::job-id-anomaly-detection[]
 Identifier for the {anomaly-job}.
 end::job-id-anomaly-detection[]

-tag::job-id-data-frame-analytics[]
-Identifier for the {dfanalytics-job}.
-end::job-id-data-frame-analytics[]
-
 tag::job-id-anomaly-detection-default[]
 Identifier for the {anomaly-job}. It can be a job identifier, a group name, or a 
 wildcard expression. If you do not specify one of these options, the API returns 
 information for all {anomaly-jobs}.
 end::job-id-anomaly-detection-default[]

-tag::job-id-data-frame-analytics-default[]
-Identifier for the {dfanalytics-job}. If you do not specify this option, the API
-returns information for the first hundred {dfanalytics-jobs}.
-end::job-id-data-frame-analytics-default[]
+tag::job-id-anomaly-detection-define[]
+Identifier for the {anomaly-job}. This identifier can contain lowercase 
+alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start 
+and end with alphanumeric characters.
+end::job-id-anomaly-detection-define[]

 tag::job-id-anomaly-detection-list[]
 An identifier for the {anomaly-jobs}. It can be a job
@ -705,11 +703,14 @@ Identifier for the {anomaly-job}. It can be a job identifier, a group name, a
 comma-separated list of jobs or groups, or a wildcard expression.
 end::job-id-anomaly-detection-wildcard-list[]

-tag::job-id-anomaly-detection-define[]
-Identifier for the {anomaly-job}. This identifier can contain lowercase 
-alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start 
-and end with alphanumeric characters.
-end::job-id-anomaly-detection-define[]
+tag::job-id-data-frame-analytics[]
+Identifier for the {dfanalytics-job}.
+end::job-id-data-frame-analytics[]
+
+tag::job-id-data-frame-analytics-default[]
+Identifier for the {dfanalytics-job}. If you do not specify this option, the API
+returns information for the first hundred {dfanalytics-jobs}.
+end::job-id-data-frame-analytics-default[]

 tag::job-id-data-frame-analytics-define[]
 Identifier for the {dfanalytics-job}. This identifier can contain lowercase 
@ -717,6 +718,10 @@ alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start
 and end with alphanumeric characters.
 end::job-id-data-frame-analytics-define[]

+tag::job-id-datafeed[]
+The unique identifier for the job to which the {dfeed} sends data.
+end::job-id-datafeed[]
+
 tag::jobs-stats-anomaly-detection[]
 An array of {anomaly-job} statistics objects.
 For more information, see <<ml-jobstats>>.
@ -745,6 +750,15 @@ the <<ml-post-data,post data>> API.
 --
 end::latency[]

+tag::max-empty-searches[]
+If a real-time {dfeed} has never seen any data (including during any initial
+training period) then it will automatically stop itself and close its associated
+job after this many real-time searches that return no documents. In other words,
+it will stop after `frequency` times `max_empty_searches` of real-time operation.
+If not set then a {dfeed} with no end time that sees no data will remain started
+until it is explicitly stopped. By default this setting is not set.
+end::max-empty-searches[]
+
 tag::maximum-number-trees[]
 Defines the maximum number of trees the forest is allowed 
 to contain. The maximum value is 2000.
@ -837,26 +851,24 @@ be seen in the model plot.

 Model plot config can be configured when the job is created or updated later. It 
 must be disabled if performance issues are experienced.
-
-The `model_plot_config` object has the following properties:
-
-`enabled`:::
-(boolean) If true, enables calculation and storage of the model bounds for
-each entity that is being analyzed. By default, this is not enabled.
-
-`terms`:::
-experimental[] (string) Limits data collection to this comma separated list of 
-partition or by field values. If terms are not specified or it is an empty 
-string, no filtering is applied. For example, "CPU,NetworkIn,DiskWrites". 
-Wildcards are not supported. Only the specified `terms` can be viewed when 
-using the Single Metric Viewer.
 --
 end::model-plot-config[]

+tag::model-plot-config-enabled[]
+If true, enables calculation and storage of the model bounds for each entity
+that is being analyzed. By default, this is not enabled.
+end::model-plot-config-enabled[]
+
+tag::model-plot-config-terms[]
+Limits data collection to this comma separated list of partition or by field
+values. If terms are not specified or it is an empty string, no filtering is
+applied. For example, "CPU,NetworkIn,DiskWrites". Wildcards are not supported.
+Only the specified `terms` can be viewed when using the Single Metric Viewer.
+end::model-plot-config-terms[]
+
 tag::model-snapshot-id[]
 A numerical character string that uniquely identifies the model snapshot. For 
-example, `1491007364`. For more information about model snapshots, see
-<<ml-snapshot-resource>>.
+example, `1575402236000 `.
 end::model-snapshot-id[]

 tag::model-snapshot-retention-days[]
@ -925,6 +937,21 @@ Defines the name of the prediction field in the results.
 Defaults to `<dependent_variable>_prediction`.
 end::prediction-field-name[]

+tag::query[]
+The {es} query domain-specific language (DSL). This value corresponds to the
+query object in an {es} search POST body. All the options that are supported by
+{es} can be used, as this object is passed verbatim to {es}. By default, this
+property has the following value: `{"match_all": {"boost": 1}}`.
+end::query[]
+
+tag::query-delay[]
+The number of seconds behind real time that data is queried. For example, if
+data from 10:04 a.m. might not be searchable in {es} until 10:06 a.m., set this
+property to 120 seconds. The default value is randomly selected between `60s`
+and `120s`. This randomness improves the query performance when there are
+multiple jobs running on the same node.
+end::query-delay[]
+
 tag::randomize-seed[]
 Defines the seed to the random generator that is used to pick
 which documents will be used for training. By default it is randomly generated.
@ -951,11 +978,33 @@ are deleted from {es}. The default value is null, which means results are
 retained.
 end::results-retention-days[]

+tag::retain[]
+If `true`, this snapshot will not be deleted during automatic cleanup of
+snapshots older than `model_snapshot_retention_days`. However, this snapshot
+will be deleted when the job is deleted. The default value is `false`.
+end::retain[]
+
+tag::script-fields[]
+Specifies scripts that evaluate custom expressions and returns script fields to
+the {dfeed}. The detector configuration objects in a job can contain functions
+that use these script fields. For more information, see
+{ml-docs}/ml-configuring-transform.html[Transforming data with script fields]
+and <<request-body-search-script-fields,Script fields>>.
+end::script-fields[]
+
+tag::scroll-size[]
+The `size` parameter that is used in {es} searches. The default value is `1000`.
+end::scroll-size[]
+
 tag::size[]
 Specifies the maximum number of {dfanalytics-jobs} to obtain. The default value 
 is `100`.
 end::size[]

+tag::snapshot-id[]
+Identifier for the model snapshot.
+end::snapshot-id[]
+
 tag::source-put-dfa[]
 The configuration of how to source the analysis data. It requires an 
 `index`. Optionally, `query` and `_source` may be specified.
@ -1006,16 +1055,6 @@ function.
 --
 end::summary-count-field-name[]

-tag::timeout-start[]
-Controls the amount of time to wait until the {dfanalytics-job} starts. Defaults 
-to 20 seconds.
-end::timeout-start[]
-
-tag::timeout-stop[]
-Controls the amount of time to wait until the {dfanalytics-job} stops. Defaults 
-to 20 seconds.
-end::timeout-stop[]
-
 tag::time-format[]
 The time format, which can be `epoch`, `epoch_ms`, or a custom pattern. The
 default value is `epoch`, which refers to UNIX or Epoch time (the number of 
@ -1033,6 +1072,25 @@ timestamp, job creation fails.
 --
 end::time-format[]

+tag::time-span[]
+The time span that each search will be querying. This setting is only applicable
+when the mode is set to `manual`. For example: `3h`.
+end::time-span[]
+
+tag::timeout-start[]
+Controls the amount of time to wait until the {dfanalytics-job} starts. Defaults 
+to 20 seconds.
+end::timeout-start[]
+
+tag::timeout-stop[]
+Controls the amount of time to wait until the {dfanalytics-job} stops. Defaults 
+to 20 seconds.
+end::timeout-stop[]
+
+tag::timestamp-results[]
+The start time of the bucket for which these results were calculated.
+end::timestamp-results[]
+
 tag::tokenizer[]
 The name or definition of the <<analysis-tokenizers,tokenizer>> to use after 
 character filters are applied. This property is compulsory if