From d479e0563a6794579c9e6c408c6fc6a475f0f58c Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Tue, 24 Dec 2019 10:22:05 -0800 Subject: [PATCH] [7.x][DOCS] Augments ML shared definitions (#50487) --- docs/reference/ml/ml-shared.asciidoc | 396 +++++++++++++++------------ 1 file changed, 227 insertions(+), 169 deletions(-) diff --git a/docs/reference/ml/ml-shared.asciidoc b/docs/reference/ml/ml-shared.asciidoc index 5ff363553d4..52e4d459600 100644 --- a/docs/reference/ml/ml-shared.asciidoc +++ b/docs/reference/ml/ml-shared.asciidoc @@ -1,3 +1,10 @@ +tag::aggregations[] +If set, the {dfeed} performs aggregation searches. Support for aggregations is +limited and should only be used with low cardinality data. For more information, +see +{ml-docs}/ml-configuring-aggregation.html[Aggregating data for faster performance]. +end::aggregations[] + tag::allow-lazy-open[] Advanced configuration option. Specifies whether this job can open when there is insufficient {ml} node capacity for it to be immediately assigned to a node. The @@ -21,6 +28,21 @@ subject to the cluster-wide `xpack.ml.max_lazy_ml_nodes` setting - see `starting` state until sufficient {ml} node capacity is available. end::allow-lazy-start[] +tag::allow-no-datafeeds[] +Specifies what to do when the request: ++ +-- +* Contains wildcard expressions and there are no {dfeeds} that match. +* Contains the `_all` string or no identifiers and there are no matches. +* Contains wildcard expressions and there are only partial matches. + +The default value is `true`, which returns an empty `datafeeds` array when +there are no matches and the subset of results when there are partial matches. +If this parameter is `false`, the request returns a `404` status code when there +are no matches or only partial matches. +-- +end::allow-no-datafeeds[] + tag::allow-no-jobs[] Specifies what to do when the request: + @@ -57,71 +79,16 @@ example: `outlier_detection`. See <>. end::analysis[] tag::analysis-config[] -The analysis configuration, which specifies how to analyze the data. -After you create a job, you cannot change the analysis configuration; all -the properties are informational. An analysis configuration object has the -following properties: - -`bucket_span`::: -(<>) -include::{docdir}/ml/ml-shared.asciidoc[tag=bucket-span] - -`categorization_field_name`::: -(string) -include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-field-name] - -`categorization_filters`::: -(array of strings) -include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-filters] - -`categorization_analyzer`::: -(object or string) -include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-analyzer] - -`detectors`::: -(array) An array of detector configuration objects. Detector configuration -objects specify which data fields a job analyzes. They also specify which -analytical functions are used. You can specify multiple detectors for a job. -include::{docdir}/ml/ml-shared.asciidoc[tag=detector] -+ --- -NOTE: If the `detectors` array does not contain at least one detector, -no analysis can occur and an error is returned. - --- - -`influencers`::: -(array of strings) -include::{docdir}/ml/ml-shared.asciidoc[tag=influencers] - -`latency`::: -(time units) -include::{docdir}/ml/ml-shared.asciidoc[tag=latency] - -`multivariate_by_fields`::: -(boolean) -include::{docdir}/ml/ml-shared.asciidoc[tag=multivariate-by-fields] - -`summary_count_field_name`::: -(string) -include::{docdir}/ml/ml-shared.asciidoc[tag=summary-count-field-name] - +The analysis configuration, which specifies how to analyze the data. After you +create a job, you cannot change the analysis configuration; all the properties +are informational. end::analysis-config[] tag::analysis-limits[] Limits can be applied for the resources required to hold the mathematical models in memory. These limits are approximate and can be set per job. They do not -control the memory used by other processes, for example the {es} Java -processes. If necessary, you can increase the limits after the job is created. -The `analysis_limits` object has the following properties: - -`categorization_examples_limit`::: -(long) -include::{docdir}/ml/ml-shared.asciidoc[tag=categorization-examples-limit] - -`model_memory_limit`::: -(long or string) -include::{docdir}/ml/ml-shared.asciidoc[tag=model-memory-limit] +control the memory used by other processes, for example the {es} Java processes. +If necessary, you can increase the limits after the job is created. end::analysis-limits[] tag::analyzed-fields[] @@ -142,7 +109,6 @@ see the <> which helps understand field selection. automatically. end::analyzed-fields[] - tag::background-persist-interval[] Advanced configuration option. The time between each periodic persistence of the model. The default value is a randomized value between 3 to 4 hours, which @@ -162,6 +128,11 @@ The size of the interval that the analysis is aggregated into, typically between see <>. end::bucket-span[] +tag::bucket-span-results[] +The length of the bucket in seconds. This value matches the `bucket_span` +that is specified in the job. +end::bucket-span-results[] + tag::by-field-name[] The field used to split the data. In particular, this property is used for analyzing the splits with respect to their own history. It is used for finding @@ -184,15 +155,15 @@ object. If it is a string it must refer to a is an object it has the following properties: -- -`char_filter`:::: +`analysis_config`.`categorization_analyzer`.`char_filter`:::: (array of strings or objects) include::{docdir}/ml/ml-shared.asciidoc[tag=char-filter] -`tokenizer`:::: +`analysis_config`.`categorization_analyzer`.`tokenizer`:::: (string or object) include::{docdir}/ml/ml-shared.asciidoc[tag=tokenizer] -`filter`:::: +`analysis_config`.`categorization_analyzer`.`filter`:::: (array of strings or objects) include::{docdir}/ml/ml-shared.asciidoc[tag=filter] end::categorization-analyzer[] @@ -246,6 +217,22 @@ add them here as <>. end::char-filter[] +tag::chunking-config[] +{dfeeds-cap} might be required to search over long time periods, for several months +or years. This search is split into time chunks in order to ensure the load +on {es} is managed. Chunking configuration controls how the size of these time +chunks are calculated and is an advanced configuration option. +A chunking configuration object has the following properties: + +`chunking_config`.`mode`::: +(string) +include::{docdir}/ml/ml-shared.asciidoc[tag=mode] + +`chunking_config`.`time_span`::: +(<>) +include::{docdir}/ml/ml-shared.asciidoc[tag=time-span] +end::chunking-config[] + tag::compute-feature-influence[] If `true`, the feature influence calculation is enabled. Defaults to `true`. end::compute-feature-influence[] @@ -255,11 +242,10 @@ An array of custom rule objects, which enable you to customize the way detectors operate. For example, a rule may dictate to the detector conditions under which results should be skipped. For more examples, see {ml-docs}/ml-configuring-detector-custom-rules.html[Customizing detectors with custom rules]. -A custom rule has the following properties: -+ --- -`actions`:: -(array) The set of actions to be triggered when the rule applies. If +end::custom-rules[] + +tag::custom-rules-actions[] +The set of actions to be triggered when the rule applies. If more than one action is specified the effects of all actions are combined. The available actions include: @@ -271,49 +257,47 @@ model. Unless you also specify `skip_result`, the results will be created as usual. This action is suitable when certain values are expected to be consistently anomalous and they affect the model in a way that negatively impacts the rest of the results. +end::custom-rules-actions[] -`scope`:: -(object) An optional scope of series where the rule applies. A rule must either +tag::custom-rules-scope[] +An optional scope of series where the rule applies. A rule must either have a non-empty scope or at least one condition. By default, the scope includes all series. Scoping is allowed for any of the fields that are also specified in `by_field_name`, `over_field_name`, or `partition_field_name`. To add a scope for a field, add the field name as a key in the scope object and set its value to an object with the following properties: +end::custom-rules-scope[] -`filter_id`::: -(string) The id of the filter to be used. +tag::custom-rules-scope-filter-id[] +The id of the filter to be used. +end::custom-rules-scope-filter-id[] -`filter_type`::: -(string) Either `include` (the rule applies for values in the filter) or -`exclude` (the rule applies for values not in the filter). Defaults to -`include`. +tag::custom-rules-scope-filter-type[] +Either `include` (the rule applies for values in the filter) or `exclude` (the +rule applies for values not in the filter). Defaults to `include`. +end::custom-rules-scope-filter-type[] -`conditions`:: -(array) An optional array of numeric conditions when the rule applies. A rule -must either have a non-empty scope or at least one condition. Multiple -conditions are combined together with a logical `AND`. A condition has the -following properties: +tag::custom-rules-conditions[] +An optional array of numeric conditions when the rule applies. A rule must +either have a non-empty scope or at least one condition. Multiple conditions are +combined together with a logical `AND`. A condition has the following properties: +end::custom-rules-conditions[] -`applies_to`::: -(string) Specifies the result property to which the condition applies. The -available options are `actual`, `typical`, `diff_from_typical`, `time`. +tag::custom-rules-conditions-applies-to[] +Specifies the result property to which the condition applies. The available +options are `actual`, `typical`, `diff_from_typical`, `time`. If your detector +uses `lat_long`, `metric`, `rare`, or `freq_rare` functions, you can only +specify conditions that apply to `time`. +end::custom-rules-conditions-applies-to[] -`operator`::: -(string) Specifies the condition operator. The available options are `gt` -(greater than), `gte` (greater than or equals), `lt` (less than) and `lte` (less -than or equals). +tag::custom-rules-conditions-operator[] +Specifies the condition operator. The available options are `gt` (greater than), +`gte` (greater than or equals), `lt` (less than) and `lte` (less than or equals). +end::custom-rules-conditions-operator[] -`value`::: -(double) The value that is compared against the `applies_to` field using the -`operator`. --- -+ --- -NOTE: If your detector uses `lat_long`, `metric`, `rare`, or `freq_rare` -functions, you can only specify `conditions` that apply to `time`. - --- -end::custom-rules[] +tag::custom-rules-conditions-value[] +The value that is compared against the `applies_to` field using the `operator`. +end::custom-rules-conditions-value[] tag::custom-settings[] Advanced configuration option. Contains custom meta data about the job. For @@ -330,16 +314,14 @@ a {dfeed}, these properties are automatically set. When data is received via the <> API, it is not stored in {es}. Only the results for {anomaly-detect} are retained. -A data description object has the following properties: - -`format`::: +`data_description`.`format`::: (string) Only `JSON` format is supported at this time. -`time_field`::: +`data_description`.`time_field`::: (string) The name of the field that contains the timestamp. The default value is `time`. -`time_format`::: +`data_description`.`time_format`::: (string) include::{docdir}/ml/ml-shared.asciidoc[tag=time-format] -- @@ -444,8 +426,8 @@ expression. end::datafeed-id-wildcard[] tag::decompress-definition[] -Specifies whether the included model definition should be returned as a JSON map (`true`) or -in a custom compressed format (`false`). Defaults to `true`. +Specifies whether the included model definition should be returned as a JSON map +(`true`) or in a custom compressed format (`false`). Defaults to `true`. end::decompress-definition[] tag::delayed-data-check-config[] @@ -462,13 +444,11 @@ moment in time. See This check runs only on real-time {dfeeds}. -The configuration object has the following properties: - -`enabled`:: +`delayed_data_check_config`.`enabled`:: (boolean) Specifies whether the {dfeed} periodically checks for delayed data. Defaults to `true`. -`check_window`:: +`delayed_data_check_config`.`check_window`:: (<>) The window of time that is searched for late data. This window of time ends with the latest finalized bucket. It defaults to `null`, which causes an appropriate `check_window` to be calculated when the @@ -485,6 +465,10 @@ that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. end::dependent-variable[] +tag::desc-results[] +If true, the results are sorted in descending order. +end::desc-results[] + tag::description-dfa[] A description of the job. end::description-dfa[] @@ -502,26 +486,6 @@ optionally `results_field` (`ml` by default). results of the analysis. Default to `ml`. end::dest[] -tag::detector-description[] -A description of the detector. For example, `Low event rate`. -end::detector-description[] - -tag::detector-field-name[] -The field that the detector uses in the function. If you use an event rate -function such as `count` or `rare`, do not specify this field. -+ --- -NOTE: The `field_name` cannot contain double quotes or backslashes. - --- -end::detector-field-name[] - -tag::detector-index[] -A unique identifier for the detector. This identifier is based on the order of -the detectors in the `analysis_config`, starting at zero. You can use this -identifier when you want to update a specific detector. -end::detector-index[] - tag::detector[] A detector has the following properties: @@ -567,6 +531,26 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=use-null] end::detector[] +tag::detector-description[] +A description of the detector. For example, `Low event rate`. +end::detector-description[] + +tag::detector-field-name[] +The field that the detector uses in the function. If you use an event rate +function such as `count` or `rare`, do not specify this field. ++ +-- +NOTE: The `field_name` cannot contain double quotes or backslashes. + +-- +end::detector-field-name[] + +tag::detector-index[] +A unique identifier for the detector. This identifier is based on the order of +the detectors in the `analysis_config`, starting at zero. You can use this +identifier when you want to update a specific detector. +end::detector-index[] + tag::eta[] The shrinkage applied to the weights. Smaller values result in larger forests which have better generalization error. However, the smaller @@ -583,6 +567,11 @@ working with both over and by fields, then you can set `exclude_frequent` to `all` for both fields, or to `by` or `over` for those specific fields. end::exclude-frequent[] +tag::exclude-interim-results[] +If `true`, the output excludes interim results. By default, interim results are +included. +end::exclude-interim-results[] + tag::feature-bag-fraction[] Defines the fraction of features that will be used when selecting a random bag for each candidate split. @@ -624,6 +613,13 @@ optional. If it is not specified, no token filters are applied prior to categorization. end::filter[] +tag::frequency[] +The interval at which scheduled queries are made while the {dfeed} runs in real +time. The default value is either the bucket span for short bucket spans, or, +for longer bucket spans, a sensible fraction of the bucket span. For example: +`150s`. +end::frequency[] + tag::from[] Skips the specified number of {dfanalytics-jobs}. The default value is `0`. end::from[] @@ -671,24 +667,26 @@ is available as part of the input data. When you use multiple detectors, the use of influencers is recommended as it aggregates results for each influencer entity. end::influencers[] +tag::is-interim[] +If `true`, this is an interim result. In other words, the results are calculated +based on partial input data. +end::is-interim[] + tag::job-id-anomaly-detection[] Identifier for the {anomaly-job}. end::job-id-anomaly-detection[] -tag::job-id-data-frame-analytics[] -Identifier for the {dfanalytics-job}. -end::job-id-data-frame-analytics[] - tag::job-id-anomaly-detection-default[] Identifier for the {anomaly-job}. It can be a job identifier, a group name, or a wildcard expression. If you do not specify one of these options, the API returns information for all {anomaly-jobs}. end::job-id-anomaly-detection-default[] -tag::job-id-data-frame-analytics-default[] -Identifier for the {dfanalytics-job}. If you do not specify this option, the API -returns information for the first hundred {dfanalytics-jobs}. -end::job-id-data-frame-analytics-default[] +tag::job-id-anomaly-detection-define[] +Identifier for the {anomaly-job}. This identifier can contain lowercase +alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start +and end with alphanumeric characters. +end::job-id-anomaly-detection-define[] tag::job-id-anomaly-detection-list[] An identifier for the {anomaly-jobs}. It can be a job @@ -705,11 +703,14 @@ Identifier for the {anomaly-job}. It can be a job identifier, a group name, a comma-separated list of jobs or groups, or a wildcard expression. end::job-id-anomaly-detection-wildcard-list[] -tag::job-id-anomaly-detection-define[] -Identifier for the {anomaly-job}. This identifier can contain lowercase -alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start -and end with alphanumeric characters. -end::job-id-anomaly-detection-define[] +tag::job-id-data-frame-analytics[] +Identifier for the {dfanalytics-job}. +end::job-id-data-frame-analytics[] + +tag::job-id-data-frame-analytics-default[] +Identifier for the {dfanalytics-job}. If you do not specify this option, the API +returns information for the first hundred {dfanalytics-jobs}. +end::job-id-data-frame-analytics-default[] tag::job-id-data-frame-analytics-define[] Identifier for the {dfanalytics-job}. This identifier can contain lowercase @@ -717,6 +718,10 @@ alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters. end::job-id-data-frame-analytics-define[] +tag::job-id-datafeed[] +The unique identifier for the job to which the {dfeed} sends data. +end::job-id-datafeed[] + tag::jobs-stats-anomaly-detection[] An array of {anomaly-job} statistics objects. For more information, see <>. @@ -745,6 +750,15 @@ the <> API. -- end::latency[] +tag::max-empty-searches[] +If a real-time {dfeed} has never seen any data (including during any initial +training period) then it will automatically stop itself and close its associated +job after this many real-time searches that return no documents. In other words, +it will stop after `frequency` times `max_empty_searches` of real-time operation. +If not set then a {dfeed} with no end time that sees no data will remain started +until it is explicitly stopped. By default this setting is not set. +end::max-empty-searches[] + tag::maximum-number-trees[] Defines the maximum number of trees the forest is allowed to contain. The maximum value is 2000. @@ -837,26 +851,24 @@ be seen in the model plot. Model plot config can be configured when the job is created or updated later. It must be disabled if performance issues are experienced. - -The `model_plot_config` object has the following properties: - -`enabled`::: -(boolean) If true, enables calculation and storage of the model bounds for -each entity that is being analyzed. By default, this is not enabled. - -`terms`::: -experimental[] (string) Limits data collection to this comma separated list of -partition or by field values. If terms are not specified or it is an empty -string, no filtering is applied. For example, "CPU,NetworkIn,DiskWrites". -Wildcards are not supported. Only the specified `terms` can be viewed when -using the Single Metric Viewer. -- end::model-plot-config[] +tag::model-plot-config-enabled[] +If true, enables calculation and storage of the model bounds for each entity +that is being analyzed. By default, this is not enabled. +end::model-plot-config-enabled[] + +tag::model-plot-config-terms[] +Limits data collection to this comma separated list of partition or by field +values. If terms are not specified or it is an empty string, no filtering is +applied. For example, "CPU,NetworkIn,DiskWrites". Wildcards are not supported. +Only the specified `terms` can be viewed when using the Single Metric Viewer. +end::model-plot-config-terms[] + tag::model-snapshot-id[] A numerical character string that uniquely identifies the model snapshot. For -example, `1491007364`. For more information about model snapshots, see -<>. +example, `1575402236000 `. end::model-snapshot-id[] tag::model-snapshot-retention-days[] @@ -925,6 +937,21 @@ Defines the name of the prediction field in the results. Defaults to `_prediction`. end::prediction-field-name[] +tag::query[] +The {es} query domain-specific language (DSL). This value corresponds to the +query object in an {es} search POST body. All the options that are supported by +{es} can be used, as this object is passed verbatim to {es}. By default, this +property has the following value: `{"match_all": {"boost": 1}}`. +end::query[] + +tag::query-delay[] +The number of seconds behind real time that data is queried. For example, if +data from 10:04 a.m. might not be searchable in {es} until 10:06 a.m., set this +property to 120 seconds. The default value is randomly selected between `60s` +and `120s`. This randomness improves the query performance when there are +multiple jobs running on the same node. +end::query-delay[] + tag::randomize-seed[] Defines the seed to the random generator that is used to pick which documents will be used for training. By default it is randomly generated. @@ -951,11 +978,33 @@ are deleted from {es}. The default value is null, which means results are retained. end::results-retention-days[] +tag::retain[] +If `true`, this snapshot will not be deleted during automatic cleanup of +snapshots older than `model_snapshot_retention_days`. However, this snapshot +will be deleted when the job is deleted. The default value is `false`. +end::retain[] + +tag::script-fields[] +Specifies scripts that evaluate custom expressions and returns script fields to +the {dfeed}. The detector configuration objects in a job can contain functions +that use these script fields. For more information, see +{ml-docs}/ml-configuring-transform.html[Transforming data with script fields] +and <>. +end::script-fields[] + +tag::scroll-size[] +The `size` parameter that is used in {es} searches. The default value is `1000`. +end::scroll-size[] + tag::size[] Specifies the maximum number of {dfanalytics-jobs} to obtain. The default value is `100`. end::size[] +tag::snapshot-id[] +Identifier for the model snapshot. +end::snapshot-id[] + tag::source-put-dfa[] The configuration of how to source the analysis data. It requires an `index`. Optionally, `query` and `_source` may be specified. @@ -1006,16 +1055,6 @@ function. -- end::summary-count-field-name[] -tag::timeout-start[] -Controls the amount of time to wait until the {dfanalytics-job} starts. Defaults -to 20 seconds. -end::timeout-start[] - -tag::timeout-stop[] -Controls the amount of time to wait until the {dfanalytics-job} stops. Defaults -to 20 seconds. -end::timeout-stop[] - tag::time-format[] The time format, which can be `epoch`, `epoch_ms`, or a custom pattern. The default value is `epoch`, which refers to UNIX or Epoch time (the number of @@ -1033,6 +1072,25 @@ timestamp, job creation fails. -- end::time-format[] +tag::time-span[] +The time span that each search will be querying. This setting is only applicable +when the mode is set to `manual`. For example: `3h`. +end::time-span[] + +tag::timeout-start[] +Controls the amount of time to wait until the {dfanalytics-job} starts. Defaults +to 20 seconds. +end::timeout-start[] + +tag::timeout-stop[] +Controls the amount of time to wait until the {dfanalytics-job} stops. Defaults +to 20 seconds. +end::timeout-stop[] + +tag::timestamp-results[] +The start time of the bucket for which these results were calculated. +end::timestamp-results[] + tag::tokenizer[] The name or definition of the <> to use after character filters are applied. This property is compulsory if