diff --git a/docs/en/ml/api-quickref.asciidoc b/docs/en/ml/api-quickref.asciidoc index f241ea593fe..ff6ac658ebb 100644 --- a/docs/en/ml/api-quickref.asciidoc +++ b/docs/en/ml/api-quickref.asciidoc @@ -10,11 +10,11 @@ All {ml} endpoints have the following base: The main {ml} resources can be accessed with a variety of endpoints: -* <>: Create and manage {ml} jobs. -* <>: Update data to be analyzed. -* <>: Access the results of a {ml} job. -* <>: Manage model snapshots. -* <>: Validate subsections of job configurations. +* <>: Create and manage {ml} jobs +* <>: Select data from {es} to be analyzed +* <>: Access the results of a {ml} job +* <>: Manage model snapshots +* <>: Validate subsections of job configurations [float] [[ml-api-jobs]] diff --git a/docs/en/ml/introduction.asciidoc b/docs/en/ml/introduction.asciidoc index 47bf43a3ee2..84d3965e601 100644 --- a/docs/en/ml/introduction.asciidoc +++ b/docs/en/ml/introduction.asciidoc @@ -19,8 +19,8 @@ science-related configurations in order to get the benefits of {ml}. === Integration with the Elastic Stack Machine learning is tightly integrated with the Elastic Stack. -Data is pulled from {es} for analysis and anomaly results are displayed in -{kb} dashboards. +Data is pulled from {es} for analysis and anomaly results are displayed in {kb} +dashboards. [float] [[ml-concepts]] @@ -36,23 +36,25 @@ Jobs:: with a job, see <>. Data feeds:: - Jobs can analyze either a batch of data from a data store or a stream of data - in real-time. The latter involves data that is retrieved from {es} and is - referred to as a data feed. + Jobs can analyze either a one-off batch of data or continuously in real-time. + Data feeds retrieve data from {es} for analysis. Alternatively you can + <> from any source directly to an API. Detectors:: Part of the configuration information associated with a job, detectors define the type of analysis that needs to be done (for example, max, average, rare). They also specify which fields to analyze. You can have more than one detector in a job, which is more efficient than running multiple jobs against the same - data stream. For a list of the properties associated with detectors, see + data. For a list of the properties associated with detectors, see <>. Buckets:: Part of the configuration information associated with a job, the _bucket span_ - defines the time interval across which the job analyzes. When setting the + defines the time interval used to summarize and model the data. This is typically + between 5 minutes to 1 hour, and it depends on your data characteristics. When setting the bucket span, take into account the granularity at which you want to analyze, - the frequency of the input data, and the frequency at which alerting is required. + the frequency of the input data, the typical duration of the anomalies + and the frequency at which alerting is required. Machine learning nodes:: A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`, diff --git a/docs/en/rest-api/ml-api.asciidoc b/docs/en/rest-api/ml-api.asciidoc index 5f95a6d394d..d980a3a7024 100644 --- a/docs/en/rest-api/ml-api.asciidoc +++ b/docs/en/rest-api/ml-api.asciidoc @@ -12,14 +12,14 @@ Use machine learning to detect anomalies in time series data. [[ml-api-datafeed-endpoint]] === Data Feeds -* <> -* <> -* <> +* <> +* <> +* <> * <> -* <> -* <> -* <> -* <> +* <> +* <> +* <> +* <> include::ml/put-datafeed.asciidoc[] include::ml/delete-datafeed.asciidoc[] @@ -35,15 +35,15 @@ include::ml/update-datafeed.asciidoc[] You can use APIs to perform the following activities: -* <> -* <> -* <> -* <> +* <> +* <> +* <> +* <> * <> -* <> -* <> -* <> -* <> +* <> +* <> +* <> +* <> * <> * <> @@ -62,10 +62,10 @@ include::ml/validate-job.asciidoc[] [[ml-api-snapshot-endpoint]] === Model Snapshots -* <> -* <> -* <> -* <> +* <> +* <> +* <> +* <> include::ml/delete-snapshot.asciidoc[] include::ml/get-snapshot.asciidoc[] @@ -91,7 +91,7 @@ include::ml/get-record.asciidoc[] * <> * <> * <> -* <> +* <> * <> * <> diff --git a/docs/en/rest-api/ml/datafeedresource.asciidoc b/docs/en/rest-api/ml/datafeedresource.asciidoc index 0bcba261577..1f547078b5e 100644 --- a/docs/en/rest-api/ml/datafeedresource.asciidoc +++ b/docs/en/rest-api/ml/datafeedresource.asciidoc @@ -7,16 +7,18 @@ A data feed resource has the following properties: `aggregations`:: (object) If set, the data feed performs aggregation searches. For syntax information, see {ref}/search-aggregations.html[Aggregations]. - Support for aggregations is limited: TBD. + Support for aggregations is limited and should only be used with + low cardinality data: For example: `{"@timestamp": {"histogram": {"field": "@timestamp", "interval": 30000,"offset": 0,"order": {"_key": "asc"},"keyed": false, "min_doc_count": 0}, "aggregations": {"events_per_min": {"sum": { "field": "events_per_min"}}}}}`. + //TBD link to a Working with aggregations page `chunking_config`:: - (object) The chunking configuration, which specifies how data searches are - chunked. See <>. + (object) Specifies how data searches are split into time chunks. + See <>. For example: {"mode": "manual", "time_span": "3h"} `datafeed_id`:: @@ -39,14 +41,12 @@ A data feed resource has the following properties: corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following - value: `{"match_all": {"boost": 1}}`. If this property is not specified, the - default value is `“match_all”: {}`. + value: `{"match_all": {"boost": 1}}`. `query_delay`:: (time units) The number of seconds behind real-time that data is queried. For example, if data from 10:04 a.m. might not be searchable in Elasticsearch - until 10:06 a.m., set this property to 120 seconds. The default value is 60 - seconds. For example: "60s". + until 10:06 a.m., set this property to 120 seconds. The default value is `60s`. `scroll_size`:: (unsigned integer) The `size` parameter that is used in Elasticsearch searches. @@ -59,11 +59,17 @@ A data feed resource has the following properties: [[ml-datafeed-chunking-config]] ===== Chunking Configuration Objects +Data feeds may be required to search over long time periods, for several months +or years. This search is split into time chunks in order to ensure the load +on {es} is managed. Chunking configuration controls how the size of these time +chunks are calculated and is an advanced configuration option. + A chunking configuration object has the following properties: `mode` (required):: There are three available modes: + - `auto`::: The chunk size will be dynamically calculated. + `auto`::: The chunk size will be dynamically calculated. This is the default + and recommended value. `manual`::: Chunking will be applied according to the specified `time_span`. `off`::: No chunking will be applied. @@ -79,20 +85,20 @@ A chunking configuration object has the following properties: The get data feed statistics API provides information about the operational progress of a data feed. For example: -`assigment_explanation`:: - TBD. For example: " " +`assignment_explanation`:: + (string) For started data feeds only, contains messages relating to the selection + of a node. `datafeed_id`:: (string) A numerical character string that uniquely identifies the data feed. `node`:: - (object) TBD - The node that is running the query? - `id`::: TBD. For example, "0-o0tOoRTwKFZifatTWKNw". - `name`::: TBD. For example, "0-o0tOo". - `ephemeral_id`::: TBD. For example, "DOZltLxLS_SzYpW6hQ9hyg". - `transport_address`::: TBD. For example, "127.0.0.1:9300". - `attributes`::: TBD. For example, {"max_running_jobs": "10"}. + (object) The node upon which the data feed is started. The data feed and job will be on the same node. + `id`::: The unique identifier of the node. For example, "0-o0tOoRTwKFZifatTWKNw". + `name`::: The node name. For example, "0-o0tOo". + `ephemeral_id`::: The node ephemeral id. + `transport_address`::: The host and port where transport HTTP connections are accepted. For example, "127.0.0.1:9300". + `attributes`::: For example, {"max_running_jobs": "10"}. `state`:: (string) The status of the data feed, which can be one of the following values: + diff --git a/docs/en/rest-api/ml/jobcounts.asciidoc b/docs/en/rest-api/ml/jobcounts.asciidoc index daa8f03dd9d..47435dd1a9b 100644 --- a/docs/en/rest-api/ml/jobcounts.asciidoc +++ b/docs/en/rest-api/ml/jobcounts.asciidoc @@ -118,14 +118,8 @@ necessarily a cause for concern. This value includes records with missing fields, since they are nonetheless analyzed. + If you use data feeds and have aggregations in your search query, - the `processed_record_count` differs from the `input_record_count`. + - If you use the <> to provide data to the job, - the following records are not processed: + -+ --- -* Records not in chronological order and outside the latency window -* Records with invalid timestamp --- + the `processed_record_count` will be the number of aggregated records + processed, not the number of {es} documents. `sparse_bucket_count`:: (long) The number of buckets that contained few data points compared to the @@ -167,12 +161,12 @@ The `model_size_stats` object has the following properties: (string) For internal use. The type of result. `total_by_field_count`:: - (long) The number of `by` field values that were analyzed by the models. + (long) The number of `by` field values that were analyzed by the models.+ NOTE: The `by` field values are counted separately for each detector and partition. `total_over_field_count`:: - (long) The number of `over` field values that were analyzed by the models. + (long) The number of `over` field values that were analyzed by the models.+ NOTE: The `over` field values are counted separately for each detector and partition. @@ -196,12 +190,10 @@ This information is available only for open jobs. (string) The node name. `ephemeral_id`:: - + (string) The ephemeral id of the node. `transport_address`:: (string) The host and port where transport HTTP connections are accepted. `attributes`:: - (object) {ml} attributes. - `max_running_jobs`::: The maximum number of concurrently open jobs that are - allowed per node. + (object) For example, {"max_running_jobs": "10"}. diff --git a/docs/en/rest-api/ml/post-data.asciidoc b/docs/en/rest-api/ml/post-data.asciidoc index 7c9c2b35450..e61f695cd77 100644 --- a/docs/en/rest-api/ml/post-data.asciidoc +++ b/docs/en/rest-api/ml/post-data.asciidoc @@ -15,9 +15,17 @@ The job must have been opened prior to sending data. File sizes are limited to 100 Mb, so if your file is larger, then split it into multiple files and upload each one separately in sequential time order. -When running in real-time, it is generally recommended to arrange to perform +When running in real-time, it is generally recommended to perform many small uploads, rather than queueing data to upload larger files. +When uploading data, check the <> for progress. +The following records will not be processed: + +* Records not in chronological order and outside the latency window +* Records with an invalid timestamp + +//TBD link to Working with Out of Order timeseries concept doc + IMPORTANT: Data can only be accepted from a single connection. Use a single connection synchronously to send data, close, flush, or delete a single job. It is not currently possible to post data to multiple jobs using wildcards diff --git a/docs/en/rest-api/ml/snapshotresource.asciidoc b/docs/en/rest-api/ml/snapshotresource.asciidoc index e9981dbb866..f14fd25c069 100644 --- a/docs/en/rest-api/ml/snapshotresource.asciidoc +++ b/docs/en/rest-api/ml/snapshotresource.asciidoc @@ -14,7 +14,6 @@ When choosing a new value, consider the following: * Persistence enables snapshots to be reverted. * The time taken to persist a job is proportional to the size of the model in memory. //* The smallest allowed value is 3600 (1 hour). -//// A model snapshot resource has the following properties: @@ -34,7 +33,8 @@ A model snapshot resource has the following properties: (object) Summary information describing the model. See <>. `retain`:: - (boolean) If true, this snapshot will not be deleted during automatic cleanup of snapshots older than `model_snapshot_retention_days`. + (boolean) If true, this snapshot will not be deleted during automatic cleanup of snapshots + older than `model_snapshot_retention_days`. However, this snapshot will be deleted when the job is deleted. The default value is false. @@ -89,4 +89,4 @@ The `model_size_stats` object has the following properties: `total_partition_field_count`:: (long) The number of _partition_ field values analyzed. -//// + diff --git a/docs/en/settings/ml-settings.asciidoc b/docs/en/settings/ml-settings.asciidoc index f1310f612f0..a55722120e0 100644 --- a/docs/en/settings/ml-settings.asciidoc +++ b/docs/en/settings/ml-settings.asciidoc @@ -1,6 +1,6 @@ [[ml-settings]] == Machine Learning Settings -You do not need to configure any settings to use {ml}. +You do not need to configure any settings to use {ml}. It is enabled by default. [float] [[general-ml-settings]]