162 lines
6.5 KiB
Plaintext
162 lines
6.5 KiB
Plaintext
[role="xpack"]
|
|
[testenv="platinum"]
|
|
[[ml-datafeed-resource]]
|
|
=== {dfeed-cap} resources
|
|
|
|
A {dfeed} resource has the following properties:
|
|
|
|
`aggregations`::
|
|
(object) If set, the {dfeed} performs aggregation searches.
|
|
Support for aggregations is limited and should only be used with
|
|
low cardinality data. For more information, see
|
|
{stack-ov}/ml-configuring-aggregation.html[Aggregating Data for Faster Performance].
|
|
|
|
`chunking_config`::
|
|
(object) Specifies how data searches are split into time chunks.
|
|
See <<ml-datafeed-chunking-config>>.
|
|
For example: `{"mode": "manual", "time_span": "3h"}`
|
|
|
|
`datafeed_id`::
|
|
(string) A numerical character string that uniquely identifies the {dfeed}.
|
|
This property is informational; you cannot change the identifier for existing
|
|
{dfeeds}.
|
|
|
|
`frequency`::
|
|
(time units) The interval at which scheduled queries are made while the
|
|
{dfeed} runs in real time. The default value is either the bucket span for short
|
|
bucket spans, or, for longer bucket spans, a sensible fraction of the bucket
|
|
span. For example: `150s`.
|
|
|
|
`indices`::
|
|
(array) An array of index names. For example: `["it_ops_metrics"]`
|
|
|
|
`job_id`::
|
|
(string) The unique identifier for the job to which the {dfeed} sends data.
|
|
|
|
`query`::
|
|
(object) The {es} query domain-specific language (DSL). This value
|
|
corresponds to the query object in an {es} search POST body. All the
|
|
options that are supported by {es} can be used, as this object is
|
|
passed verbatim to {es}. By default, this property has the following
|
|
value: `{"match_all": {"boost": 1}}`.
|
|
|
|
`query_delay`::
|
|
(time units) The number of seconds behind real time that data is queried. For
|
|
example, if data from 10:04 a.m. might not be searchable in {es} until
|
|
10:06 a.m., set this property to 120 seconds. The default value is randomly
|
|
selected between `60s` and `120s`. This randomness improves the query
|
|
performance when there are multiple jobs running on the same node.
|
|
|
|
`script_fields`::
|
|
(object) Specifies scripts that evaluate custom expressions and returns
|
|
script fields to the {dfeed}.
|
|
The <<ml-detectorconfig,detector configuration objects>> in a job can contain
|
|
functions that use these script fields.
|
|
For more information, see
|
|
{stack-ov}/ml-configuring-transform.html[Transforming Data With Script Fields].
|
|
|
|
`scroll_size`::
|
|
(unsigned integer) The `size` parameter that is used in {es} searches.
|
|
The default value is `1000`.
|
|
|
|
`delayed_data_check_config`::
|
|
(object) Specifies whether the data feed checks for missing data and
|
|
the size of the window. For example:
|
|
`{"enabled": true, "check_window": "1h"}` See
|
|
<<ml-datafeed-delayed-data-check-config>>.
|
|
|
|
`max_empty_searches`::
|
|
(integer) If a real-time {dfeed} has never seen any data (including during
|
|
any initial training period) then it will automatically stop itself and
|
|
close its associated job after this many real-time searches that return no
|
|
documents. In other words, it will stop after `frequency` times
|
|
`max_empty_searches` of real-time operation. If not set
|
|
then a {dfeed} with no end time that sees no data will remain started until
|
|
it is explicitly stopped. By default this setting is not set.
|
|
|
|
[[ml-datafeed-chunking-config]]
|
|
==== Chunking configuration objects
|
|
|
|
{dfeeds-cap} might be required to search over long time periods, for several months
|
|
or years. This search is split into time chunks in order to ensure the load
|
|
on {es} is managed. Chunking configuration controls how the size of these time
|
|
chunks are calculated and is an advanced configuration option.
|
|
|
|
A chunking configuration object has the following properties:
|
|
|
|
`mode`::
|
|
There are three available modes: +
|
|
`auto`::: The chunk size will be dynamically calculated. This is the default
|
|
and recommended value.
|
|
`manual`::: Chunking will be applied according to the specified `time_span`.
|
|
`off`::: No chunking will be applied.
|
|
|
|
`time_span`::
|
|
(time units) The time span that each search will be querying.
|
|
This setting is only applicable when the mode is set to `manual`.
|
|
For example: `3h`.
|
|
|
|
[[ml-datafeed-delayed-data-check-config]]
|
|
==== Delayed data check configuration objects
|
|
|
|
The {dfeed} can optionally search over indices that have already been read in
|
|
an effort to determine whether any data has subsequently been added to the index.
|
|
If missing data is found, it is a good indication that the `query_delay` option
|
|
is set too low and the data is being indexed after the {dfeed} has passed that
|
|
moment in time. See
|
|
{stack-ov}/ml-delayed-data-detection.html[Working with delayed data].
|
|
|
|
This check runs only on real-time {dfeeds}.
|
|
|
|
The configuration object has the following properties:
|
|
|
|
`enabled`::
|
|
(boolean) Specifies whether the {dfeed} periodically checks for delayed data.
|
|
Defaults to `true`.
|
|
|
|
`check_window`::
|
|
(time units) The window of time that is searched for late data. This window of
|
|
time ends with the latest finalized bucket. It defaults to `null`, which
|
|
causes an appropriate `check_window` to be calculated when the real-time
|
|
{dfeed} runs. In particular, the default `check_window` span calculation is
|
|
based on the maximum of `2h` or `8 * bucket_span`.
|
|
|
|
[float]
|
|
[[ml-datafeed-counts]]
|
|
==== {dfeed-cap} counts
|
|
|
|
The get {dfeed} statistics API provides information about the operational
|
|
progress of a {dfeed}. All of these properties are informational; you cannot
|
|
update their values:
|
|
|
|
`assignment_explanation`::
|
|
(string) For started {dfeeds} only, contains messages relating to the
|
|
selection of a node.
|
|
|
|
`datafeed_id`::
|
|
(string) A numerical character string that uniquely identifies the {dfeed}.
|
|
|
|
`node`::
|
|
(object) The node upon which the {dfeed} is started. The {dfeed} and job will
|
|
be on the same node.
|
|
`id`::: The unique identifier of the node. For example,
|
|
"0-o0tOoRTwKFZifatTWKNw".
|
|
`name`::: The node name. For example, `0-o0tOo`.
|
|
`ephemeral_id`::: The node ephemeral ID.
|
|
`transport_address`::: The host and port where transport HTTP connections are
|
|
accepted. For example, `127.0.0.1:9300`.
|
|
`attributes`::: For example, `{"ml.machine_memory": "17179869184"}`.
|
|
|
|
`state`::
|
|
(string) The status of the {dfeed}, which can be one of the following values: +
|
|
`started`::: The {dfeed} is actively receiving data.
|
|
`stopped`::: The {dfeed} is stopped and will not receive data until it is
|
|
re-started.
|
|
|
|
`timing_stats`::
|
|
(object) An object that provides statistical information about timing aspect of this datafeed. +
|
|
`job_id`::: A numerical character string that uniquely identifies the job.
|
|
`search_count`::: Number of searches performed by this datafeed.
|
|
`total_search_time_ms`::: Total time the datafeed spent searching in milliseconds.
|
|
|