OpenSearch/docs/reference/ml/anomaly-detection/apis/datafeedresource.asciidoc

[role="xpack"]
[testenv="platinum"]
[[ml-datafeed-resource]]
=== {dfeed-cap} resources

A {dfeed} resource has the following properties:

`aggregations`::
  (object) If set, the {dfeed} performs aggregation searches.
  Support for aggregations is limited and should only be used with
  low cardinality data. For more information, see
  {stack-ov}/ml-configuring-aggregation.html[Aggregating Data for Faster Performance].

`chunking_config`::
  (object) Specifies how data searches are split into time chunks.
  See <<ml-datafeed-chunking-config>>.
  For example: `{"mode": "manual", "time_span": "3h"}`

`datafeed_id`::
 (string) A numerical character string that uniquely identifies the {dfeed}.
 This property is informational; you cannot change the identifier for existing
 {dfeeds}.

`frequency`::
  (time units) The interval at which scheduled queries are made while the
  {dfeed} runs in real time. The default value is either the bucket span for short
  bucket spans, or, for longer bucket spans, a sensible fraction of the bucket
  span. For example: `150s`.

`indices`::
  (array) An array of index names. For example: `["it_ops_metrics"]`

`job_id`::
 (string) The unique identifier for the job to which the {dfeed} sends data.

`query`::
  (object) The {es} query domain-specific language (DSL). This value
  corresponds to the query object in an {es} search POST body. All the
  options that are supported by {es} can be used, as this object is
  passed verbatim to {es}. By default, this property has the following
  value: `{"match_all": {"boost": 1}}`.

`query_delay`::
  (time units) The number of seconds behind real time that data is queried. For
  example, if data from 10:04 a.m. might not be searchable in {es} until
  10:06 a.m., set this property to 120 seconds. The default value is randomly
  selected between `60s` and `120s`. This randomness improves the query
  performance when there are multiple jobs running on the same node.

`script_fields`::
  (object) Specifies scripts that evaluate custom expressions and returns
  script fields to the {dfeed}.
  The <<ml-detectorconfig,detector configuration objects>> in a job can contain
  functions that use these script fields.
  For more information, see
  {stack-ov}/ml-configuring-transform.html[Transforming Data With Script Fields].

`scroll_size`::
  (unsigned integer) The `size` parameter that is used in {es} searches.
  The default value is `1000`.

`delayed_data_check_config`::
  (object) Specifies whether the data feed checks for missing data and
  the size of the window. For example:
  `{"enabled": true, "check_window": "1h"}` See
  <<ml-datafeed-delayed-data-check-config>>.

`max_empty_searches`::
  (integer) If a real-time {dfeed} has never seen any data (including during
  any initial training period) then it will automatically stop itself and
  close its associated job after this many real-time searches that return no
  documents. In other words, it will stop after `frequency` times
  `max_empty_searches` of real-time operation. If not set
  then a {dfeed} with no end time that sees no data will remain started until
  it is explicitly stopped. By default this setting is not set.

[[ml-datafeed-chunking-config]]
==== Chunking configuration objects

{dfeeds-cap} might be required to search over long time periods, for several months
or years. This search is split into time chunks in order to ensure the load
on {es} is managed. Chunking configuration controls how the size of these time
chunks are calculated and is an advanced configuration option.

A chunking configuration object has the following properties:

`mode`::
  There are three available modes: +
  `auto`::: The chunk size will be dynamically calculated. This is the default
  and recommended value.
  `manual`::: Chunking will be applied according to the specified `time_span`.
  `off`::: No chunking will be applied.

`time_span`::
  (time units) The time span that each search will be querying.
  This setting is only applicable when the mode is set to `manual`.
  For example: `3h`.

[[ml-datafeed-delayed-data-check-config]]
==== Delayed data check configuration objects

The {dfeed} can optionally search over indices that have already been read in
an effort to determine whether any data has subsequently been added to the index.
If missing data is found, it is a good indication that the `query_delay` option
is set too low and the data is being indexed after the {dfeed} has passed that
moment in time. See
{stack-ov}/ml-delayed-data-detection.html[Working with delayed data].

This check runs only on real-time {dfeeds}.

The configuration object has the following properties:

`enabled`::
  (boolean) Specifies whether the {dfeed} periodically checks for delayed data.
  Defaults to `true`.

`check_window`::
  (time units) The window of time that is searched for late data. This window of
  time ends with the latest finalized bucket. It defaults to `null`, which
  causes an appropriate `check_window` to be calculated when the real-time
  {dfeed} runs. In particular, the default `check_window` span calculation is
  based on the maximum of `2h` or `8 * bucket_span`.

[float]
[[ml-datafeed-counts]]
==== {dfeed-cap} counts

The get {dfeed} statistics API provides information about the operational
progress of a {dfeed}. All of these properties are informational; you cannot
update their values:

`assignment_explanation`::
  (string) For started {dfeeds} only, contains messages relating to the
  selection of a node.

`datafeed_id`::
 (string) A numerical character string that uniquely identifies the {dfeed}.

`node`::
  (object) The node upon which the {dfeed} is started. The {dfeed} and job will
  be on the same node.
  `id`::: The unique identifier of the node. For example,
  "0-o0tOoRTwKFZifatTWKNw".
  `name`::: The node name. For example, `0-o0tOo`.
  `ephemeral_id`::: The node ephemeral ID.
  `transport_address`::: The host and port where transport HTTP connections are
  accepted. For example, `127.0.0.1:9300`.
  `attributes`::: For example, `{"ml.machine_memory": "17179869184"}`.

`state`::
  (string) The status of the {dfeed}, which can be one of the following values: +
  `started`::: The {dfeed} is actively receiving data.
  `stopped`::: The {dfeed} is stopped and will not receive data until it is
  re-started.

`timing_stats`::
  (object) An object that provides statistical information about timing aspect of this datafeed. +
  `job_id`::: A numerical character string that uniquely identifies the job.
  `search_count`::: Number of searches performed by this datafeed.
  `total_search_time_ms`::: Total time the datafeed spent searching in milliseconds.