[7.x][DOCS] Move datafeed resource definitions into APIs (#50516)

This commit is contained in:
Lisa Cawley 2019-12-30 09:35:16 -08:00 committed by GitHub
parent 8869f2b9b2
commit 4b829db593
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
11 changed files with 168 additions and 312 deletions

View File

@ -1,161 +0,0 @@
[role="xpack"]
[testenv="platinum"]
[[ml-datafeed-resource]]
=== {dfeed-cap} resources
A {dfeed} resource has the following properties:
`aggregations`::
(object) If set, the {dfeed} performs aggregation searches.
Support for aggregations is limited and should only be used with
low cardinality data. For more information, see
{ml-docs}/ml-configuring-aggregation.html[Aggregating data for faster performance].
`chunking_config`::
(object) Specifies how data searches are split into time chunks.
See <<ml-datafeed-chunking-config>>.
For example: `{"mode": "manual", "time_span": "3h"}`
`datafeed_id`::
(string) A numerical character string that uniquely identifies the {dfeed}.
This property is informational; you cannot change the identifier for existing
{dfeeds}.
`frequency`::
(time units) The interval at which scheduled queries are made while the
{dfeed} runs in real time. The default value is either the bucket span for short
bucket spans, or, for longer bucket spans, a sensible fraction of the bucket
span. For example: `150s`.
`indices`::
(array) An array of index names. For example: `["it_ops_metrics"]`
`job_id`::
(string) The unique identifier for the job to which the {dfeed} sends data.
`query`::
(object) The {es} query domain-specific language (DSL). This value
corresponds to the query object in an {es} search POST body. All the
options that are supported by {es} can be used, as this object is
passed verbatim to {es}. By default, this property has the following
value: `{"match_all": {"boost": 1}}`.
`query_delay`::
(time units) The number of seconds behind real time that data is queried. For
example, if data from 10:04 a.m. might not be searchable in {es} until
10:06 a.m., set this property to 120 seconds. The default value is randomly
selected between `60s` and `120s`. This randomness improves the query
performance when there are multiple jobs running on the same node.
`script_fields`::
(object) Specifies scripts that evaluate custom expressions and returns
script fields to the {dfeed}.
The detector configuration objects in a job can contain
functions that use these script fields.
For more information, see
{ml-docs}/ml-configuring-transform.html[Transforming data with script fields].
`scroll_size`::
(unsigned integer) The `size` parameter that is used in {es} searches.
The default value is `1000`.
`delayed_data_check_config`::
(object) Specifies whether the data feed checks for missing data and
the size of the window. For example:
`{"enabled": true, "check_window": "1h"}` See
<<ml-datafeed-delayed-data-check-config>>.
`max_empty_searches`::
(integer) If a real-time {dfeed} has never seen any data (including during
any initial training period) then it will automatically stop itself and
close its associated job after this many real-time searches that return no
documents. In other words, it will stop after `frequency` times
`max_empty_searches` of real-time operation. If not set
then a {dfeed} with no end time that sees no data will remain started until
it is explicitly stopped. By default this setting is not set.
[[ml-datafeed-chunking-config]]
==== Chunking configuration objects
{dfeeds-cap} might be required to search over long time periods, for several months
or years. This search is split into time chunks in order to ensure the load
on {es} is managed. Chunking configuration controls how the size of these time
chunks are calculated and is an advanced configuration option.
A chunking configuration object has the following properties:
`mode`::
There are three available modes: +
`auto`::: The chunk size will be dynamically calculated. This is the default
and recommended value.
`manual`::: Chunking will be applied according to the specified `time_span`.
`off`::: No chunking will be applied.
`time_span`::
(time units) The time span that each search will be querying.
This setting is only applicable when the mode is set to `manual`.
For example: `3h`.
[[ml-datafeed-delayed-data-check-config]]
==== Delayed data check configuration objects
The {dfeed} can optionally search over indices that have already been read in
an effort to determine whether any data has subsequently been added to the index.
If missing data is found, it is a good indication that the `query_delay` option
is set too low and the data is being indexed after the {dfeed} has passed that
moment in time. See
{ml-docs}/ml-delayed-data-detection.html[Working with delayed data].
This check runs only on real-time {dfeeds}.
The configuration object has the following properties:
`enabled`::
(boolean) Specifies whether the {dfeed} periodically checks for delayed data.
Defaults to `true`.
`check_window`::
(time units) The window of time that is searched for late data. This window of
time ends with the latest finalized bucket. It defaults to `null`, which
causes an appropriate `check_window` to be calculated when the real-time
{dfeed} runs. In particular, the default `check_window` span calculation is
based on the maximum of `2h` or `8 * bucket_span`.
[float]
[[ml-datafeed-counts]]
==== {dfeed-cap} counts
The get {dfeed} statistics API provides information about the operational
progress of a {dfeed}. All of these properties are informational; you cannot
update their values:
`assignment_explanation`::
(string) For started {dfeeds} only, contains messages relating to the
selection of a node.
`datafeed_id`::
(string) A numerical character string that uniquely identifies the {dfeed}.
`node`::
(object) The node upon which the {dfeed} is started. The {dfeed} and job will
be on the same node.
`id`::: The unique identifier of the node. For example,
"0-o0tOoRTwKFZifatTWKNw".
`name`::: The node name. For example, `0-o0tOo`.
`ephemeral_id`::: The node ephemeral ID.
`transport_address`::: The host and port where transport HTTP connections are
accepted. For example, `127.0.0.1:9300`.
`attributes`::: For example, `{"ml.machine_memory": "17179869184"}`.
`state`::
(string) The status of the {dfeed}, which can be one of the following values: +
`started`::: The {dfeed} is actively receiving data.
`stopped`::: The {dfeed} is stopped and will not receive data until it is
re-started.
`timing_stats`::
(object) An object that provides statistical information about timing aspect of this datafeed. +
`job_id`::: A numerical character string that uniquely identifies the job.
`search_count`::: Number of searches performed by this datafeed.
`total_search_time_ms`::: Total time the datafeed spent searching in milliseconds.

View File

@ -28,7 +28,8 @@ can delete it.
==== {api-path-parms-title}
`<feed_id>`::
(Required, string) Identifier for the {dfeed}.
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id]
[[ml-delete-datafeed-query-parms]]
==== {api-query-parms-title}

View File

@ -45,36 +45,66 @@ IMPORTANT: This API returns a maximum of 10,000 {dfeeds}.
==== {api-path-parms-title}
`<feed_id>`::
(Optional, string) Identifier for the {dfeed}. It can be a {dfeed} identifier
or a wildcard expression. If you do not specify one of these options, the API
returns statistics for all {dfeeds}.
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id-wildcard]
+
--
If you do not specify one of these options, the API returns information about
all {dfeeds}.
--
[[ml-get-datafeed-stats-query-parms]]
==== {api-query-parms-title}
`allow_no_datafeeds`::
(Optional, boolean) Specifies what to do when the request:
+
--
* Contains wildcard expressions and there are no {datafeeds} that match.
* Contains the `_all` string or no identifiers and there are no matches.
* Contains wildcard expressions and there are only partial matches.
The default value is `true`, which returns an empty `datafeeds` array when
there are no matches and the subset of results when there are partial matches.
If this parameter is `false`, the request returns a `404` status code when there
are no matches or only partial matches.
--
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-no-datafeeds]
[[ml-get-datafeed-stats-results]]
==== {api-response-body-title}
The API returns the following information:
The API returns an array of {dfeed} count objects. All of these properties are
informational; you cannot update their values.
`datafeeds`::
(array) An array of {dfeed} count objects.
For more information, see <<ml-datafeed-counts>>.
`assignment_explanation`::
(string) For started {dfeeds} only, contains messages relating to the selection
of a node.
`datafeed_id`::
(string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id]
`node`::
(object) The node upon which the {dfeed} is started. The {dfeed} and job will be
on the same node.
`node`.`id`::: The unique identifier of the node. For example,
`0-o0tOoRTwKFZifatTWKNw`.
`node`.`name`::: The node name. For example, `0-o0tOo`.
`node`.`ephemeral_id`::: The node ephemeral ID.
`node`.`transport_address`::: The host and port where transport HTTP connections
are accepted. For example, `127.0.0.1:9300`.
`node`.`attributes`::: For example, `{"ml.machine_memory": "17179869184"}`.
`state`::
(string) The status of the {dfeed}, which can be one of the following values:
+
--
* `started`::: The {dfeed} is actively receiving data.
* `stopped`::: The {dfeed} is stopped and will not receive data until it is
re-started.
--
`timing_stats`::
(object) An object that provides statistical information about timing aspect of
this {dfeed}.
//average_search_time_per_bucket_ms
//bucket_count
//exponential_average_search_time_per_hour_ms
`timing_stats`.`job_id`:::
include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-anomaly-detection]
`timing_stats`.`search_count`::: Number of searches performed by this {dfeed}.
`timing_stats`.`total_search_time_ms`::: Total time the {dfeed} spent searching
in milliseconds.
[[ml-get-datafeed-stats-response-codes]]
==== {api-response-codes-title}
@ -86,14 +116,11 @@ The API returns the following information:
[[ml-get-datafeed-stats-example]]
==== {api-examples-title}
The following example gets usage information for the
`datafeed-total-requests` {dfeed}:
[source,console]
--------------------------------------------------
GET _ml/datafeeds/datafeed-total-requests/_stats
GET _ml/datafeeds/datafeed-high_sum_total_sales/_stats
--------------------------------------------------
// TEST[skip:setup:server_metrics_startdf]
// TEST[skip:Kibana sample data started datafeed]
The API returns the following results:
@ -103,7 +130,7 @@ The API returns the following results:
"count": 1,
"datafeeds": [
{
"datafeed_id": "datafeed-total-requests",
"datafeed_id": "datafeed-high_sum_total_sales",
"state": "started",
"node": {
"id": "2spCyo1pRi2Ajo-j-_dnPX",
@ -117,9 +144,12 @@ The API returns the following results:
},
"assignment_explanation": "",
"timing_stats": {
"job_id": "job-total-requests",
"search_count": 20,
"total_search_time_ms": 120.5
"job_id" : "high_sum_total_sales",
"search_count" : 27,
"bucket_count" : 619,
"total_search_time_ms" : 296.0,
"average_search_time_per_bucket_ms" : 0.4781906300484653,
"exponential_average_search_time_per_hour_ms" : 33.28246548059884
}
}
]

View File

@ -42,35 +42,26 @@ IMPORTANT: This API returns a maximum of 10,000 {dfeeds}.
==== {api-path-parms-title}
`<feed_id>`::
(Optional, string) Identifier for the {dfeed}. It can be a {dfeed} identifier
or a wildcard expression. If you do not specify one of these options, the API
returns information about all {dfeeds}.
(Optional, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id-wildcard]
+
--
If you do not specify one of these options, the API returns information about
all {dfeeds}.
--
[[ml-get-datafeed-query-parms]]
==== {api-query-parms-title}
`allow_no_datafeeds`::
(Optional, boolean) Specifies what to do when the request:
+
--
* Contains wildcard expressions and there are no {datafeeds} that match.
* Contains the `_all` string or no identifiers and there are no matches.
* Contains wildcard expressions and there are only partial matches.
The default value is `true`, which returns an empty `datafeeds` array when
there are no matches and the subset of results when there are partial matches.
If this parameter is `false`, the request returns a `404` status code when there
are no matches or only partial matches.
--
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-no-datafeeds]
[[ml-get-datafeed-results]]
==== {api-response-body-title}
The API returns the following information:
`datafeeds`::
(array) An array of {dfeed} objects.
For more information, see <<ml-datafeed-resource>>.
The API returns an array of {dfeed} resources. For the full list of properties,
see <<ml-put-datafeed-request-body,create {dfeeds} API>>.
[[ml-get-datafeed-response-codes]]
==== {api-response-codes-title}
@ -82,14 +73,11 @@ The API returns the following information:
[[ml-get-datafeed-example]]
==== {api-examples-title}
The following example gets configuration information for the
`datafeed-total-requests` {dfeed}:
[source,console]
--------------------------------------------------
GET _ml/datafeeds/datafeed-total-requests
GET _ml/datafeeds/datafeed-high_sum_total_sales
--------------------------------------------------
// TEST[skip:setup:server_metrics_datafeed]
// TEST[skip:Kibana sample data]
The API returns the following results:
@ -99,23 +87,31 @@ The API returns the following results:
"count": 1,
"datafeeds": [
{
"datafeed_id": "datafeed-total-requests",
"job_id": "total-requests",
"query_delay": "83474ms",
"datafeed_id": "datafeed-high_sum_total_sales",
"job_id": "high_sum_total_sales",
"query_delay": "93169ms",
"indices": [
"server-metrics"
"kibana_sample_data_ecommerce"
],
"query": {
"match_all": {
"boost": 1.0
"query" : {
"bool" : {
"filter" : [
{
"term" : {
"_index" : "kibana_sample_data_ecommerce"
}
}
]
}
},
"scroll_size": 1000,
"chunking_config": {
"mode": "auto"
},
"delayed_data_check_config" : {
"enabled" : true
}
}
]
}
----
// TESTRESPONSE[s/"query.boost": "1.0"/"query.boost": $body.query.boost/]

View File

@ -41,18 +41,17 @@ it to ensure it is returning the expected data.
==== {api-path-parms-title}
`<datafeed_id>`::
(Required, string) Identifier for the {dfeed}.
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id]
[[ml-preview-datafeed-example]]
==== {api-examples-title}
The following example obtains a preview of the `datafeed-farequote` {dfeed}:
[source,console]
--------------------------------------------------
GET _ml/datafeeds/datafeed-farequote/_preview
GET _ml/datafeeds/datafeed-high_sum_total_sales/_preview
--------------------------------------------------
// TEST[skip:setup:farequote_datafeed]
// TEST[skip:Kibana sample data]
The data that is returned for this example is as follows:
@ -60,22 +59,29 @@ The data that is returned for this example is as follows:
----
[
{
"time": 1454803200000,
"airline": "JZA",
"doc_count": 5,
"responsetime": 990.4628295898438
"order_date" : 1575504259000,
"category.keyword" : "Men's Clothing",
"customer_full_name.keyword" : "Sultan Al Benson",
"taxful_total_price" : 35.96875
},
{
"time": 1454803200000,
"airline": "JBU",
"doc_count": 23,
"responsetime": 877.5927124023438
"order_date" : 1575504518000,
"category.keyword" : [
"Women's Accessories",
"Women's Clothing"
],
"customer_full_name.keyword" : "Pia Webb",
"taxful_total_price" : 83.0
},
{
"time": 1454803200000,
"airline": "KLM",
"doc_count": 42,
"responsetime": 1355.481201171875
}
"order_date" : 1575505382000,
"category.keyword" : [
"Women's Accessories",
"Women's Shoes"
],
"customer_full_name.keyword" : "Brigitte Graham",
"taxful_total_price" : 72.0
},
...
]
----

View File

@ -43,70 +43,55 @@ those same roles.
==== {api-path-parms-title}
`<feed_id>`::
(Required, string) A numerical character string that uniquely identifies the
{dfeed}. This identifier can contain lowercase alphanumeric characters (a-z
and 0-9), hyphens, and underscores. It must start and end with alphanumeric
characters.
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id]
[[ml-put-datafeed-request-body]]
==== {api-request-body-title}
`aggregations`::
(Optional, object) If set, the {dfeed} performs aggregation searches. For more
information, see <<ml-datafeed-resource>>.
(Optional, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=aggregations]
`chunking_config`::
(Optional, object) Specifies how data searches are split into time chunks. See
<<ml-datafeed-chunking-config>>.
(Optional, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=chunking-config]
`delayed_data_check_config`::
(Optional, object) Specifies whether the data feed checks for missing data and
the size of the window. See <<ml-datafeed-delayed-data-check-config>>.
(Optional, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=delayed-data-check-config]
`frequency`::
(Optional, <<time-units, time units>>) The interval at which scheduled queries
are made while the {dfeed} runs in real time. The default value is either the
bucket span for short bucket spans, or, for longer bucket spans, a sensible
fraction of the bucket span. For example: `150s`.
(Optional, <<time-units, time units>>)
include::{docdir}/ml/ml-shared.asciidoc[tag=frequency]
`indices`::
(Required, array) An array of index names. Wildcards are supported. For
example: `["it_ops_metrics", "server*"]`.
+
--
NOTE: If any indices are in remote clusters then `cluster.remote.connect` must
not be set to `false` on any ML node.
--
(Required, array)
include::{docdir}/ml/ml-shared.asciidoc[tag=indices]
`job_id`::
(Required, string) A numerical character string that uniquely identifies the
{anomaly-job}.
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=job-id-anomaly-detection]
`max_empty_searches`::
(Optional,integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=max-empty-searches]
`query`::
(Optional, object) The {es} query domain-specific language (DSL). This value
corresponds to the query object in an {es} search POST body. All the options
that are supported by {Es} can be used, as this object is passed verbatim to
{es}. By default, this property has the following value:
`{"match_all": {"boost": 1}}`.
(Optional, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=query]
`query_delay`::
(Optional, <<time-units, time units>>) The number of seconds behind real time
that data is queried. For example, if data from 10:04 a.m. might not be
searchable in {es} until 10:06 a.m., set this property to 120 seconds. The
default value is `60s`.
(Optional, <<time-units, time units>>)
include::{docdir}/ml/ml-shared.asciidoc[tag=query-delay]
`script_fields`::
(Optional, object) Specifies scripts that evaluate custom expressions and
returns script fields to the {dfeed}. The detector configuration objects in a
job can contain functions that use these script fields. For more information,
see <<request-body-search-script-fields,Script fields>>.
(Optional, object)
include::{docdir}/ml/ml-shared.asciidoc[tag=script-fields]
`scroll_size`::
(Optional, unsigned integer) The `size` parameter that is used in {es}
searches. The default value is `1000`.
For more information about these properties,
see <<ml-datafeed-resource>>.
(Optional, unsigned integer)
include::{docdir}/ml/ml-shared.asciidoc[tag=scroll-size]
[[ml-put-datafeed-example]]
==== {api-examples-title}

View File

@ -74,7 +74,8 @@ creation/update and runs the query using those same roles.
==== {api-path-parms-title}
`<feed_id>`::
(Required, string) Identifier for the {dfeed}.
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id]
[[ml-start-datafeed-request-body]]
==== {api-request-body-title}
@ -94,7 +95,7 @@ creation/update and runs the query using those same roles.
[[ml-start-datafeed-example]]
==== {api-examples-title}
The following example starts the `datafeed-it-ops-kpi` {dfeed}:
The following example starts the `datafeed-total-requests` {dfeed}:
[source,console]
--------------------------------------------------

View File

@ -40,25 +40,15 @@ comma-separated list of {dfeeds} or a wildcard expression. You can close all
==== {api-path-parms-title}
`<feed_id>`::
(Required, string) Identifier for the {dfeed}. It can be a {dfeed} identifier
or a wildcard expression.
(Required, string)
include::{docdir}/ml/ml-shared.asciidoc[tag=datafeed-id-wildcard]
[[ml-stop-datafeed-query-parms]]
==== {api-query-parms-title}
`allow_no_datafeeds`::
(Optional, boolean) Specifies what to do when the request:
+
--
* Contains wildcard expressions and there are no {datafeeds} that match.
* Contains the `_all` string or no identifiers and there are no matches.
* Contains wildcard expressions and there are only partial matches.
The default value is `true`, which returns an empty `datafeeds` array when
there are no matches and the subset of results when there are partial matches.
If this parameter is `false`, the request returns a `404` status code when there
are no matches or only partial matches.
--
(Optional, boolean)
include::{docdir}/ml/ml-shared.asciidoc[tag=allow-no-datafeeds]
[[ml-stop-datafeed-request-body]]
==== {api-request-body-title}

View File

@ -5,14 +5,15 @@
Delayed data are documents that are indexed late. That is to say, it is data
related to a time that the {dfeed} has already processed.
When you create a datafeed, you can specify a
{ref}/ml-datafeed-resource.html[`query_delay`] setting. This setting enables the
datafeed to wait for some time past real-time, which means any "late" data in
this period is fully indexed before the datafeed tries to gather it. However, if
the setting is set too low, the datafeed may query for data before it has been
indexed and consequently miss that document. Conversely, if it is set too high,
analysis drifts farther away from real-time. The balance that is struck depends
upon each use case and the environmental factors of the cluster.
When you create a {dfeed}, you can specify a
{ref}/ml-put-datafeed.html#ml-put-datafeed-request-body[`query_delay`] setting.
This setting enables the {dfeed} to wait for some time past real-time, which
means any "late" data in this period is fully indexed before the {dfeed} tries
to gather it. However, if the setting is set too low, the {dfeed} may query for
data before it has been indexed and consequently miss that document. Conversely,
if it is set too high, analysis drifts farther away from real-time. The balance
that is struck depends upon each use case and the environmental factors of the
cluster.
==== Why worry about delayed data?
@ -28,8 +29,7 @@ recorded so that you can determine a next course of action.
==== How do we detect delayed data?
In addition to the `query_delay` field, there is a
{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config],
In addition to the `query_delay` field, there is a delayed data check config,
which enables you to configure the datafeed to look in the past for delayed data.
Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
triggers a document search over the configured indices. This search looks over a

View File

@ -465,3 +465,14 @@ This page was deleted.
See the details in
[[ml-apimodelplotconfig]]
<<ml-put-job>>, <<ml-update-job>>, and <<ml-get-job>>.
[role="exclude",id="ml-datafeed-resource"]
=== {dfeed-cap} resources
This page was deleted.
[[ml-datafeed-chunking-config]]
See the details in <<ml-put-datafeed>>, <<ml-update-datafeed>>,
[[ml-datafeed-delayed-data-check-config]]
<<ml-get-datafeed>>,
[[ml-datafeed-counts]]
<<ml-get-datafeed-stats>>.

View File

@ -5,15 +5,12 @@
These resource definitions are used in APIs related to {ml-features} and
{security-features} and in {kib} advanced {ml} job configuration options.
* <<ml-datafeed-resource,{dfeeds-cap}>>
* <<ml-datafeed-counts,{dfeed-cap} counts>>
* <<ml-dfa-analysis-objects>>
* <<ml-jobstats,{anomaly-jobs-cap} statistics>>
* <<ml-snapshot-resource,{anomaly-detect-cap} model snapshots>>
* <<ml-results-resource,{anomaly-detect-cap} results>>
* <<role-mapping-resources,Role mappings>>
include::{es-repo-dir}/ml/anomaly-detection/apis/datafeedresource.asciidoc[]
include::{es-repo-dir}/ml/df-analytics/apis/analysisobjects.asciidoc[]
include::{es-repo-dir}/ml/anomaly-detection/apis/jobcounts.asciidoc[]
include::{es-repo-dir}/ml/anomaly-detection/apis/snapshotresource.asciidoc[]