* [DOCS] Overall review

* [DOCS] General review

* [DOCS] typo

* [DOCS] Fix for processed_record_count with aggs

* [DOCS] Added latency tbd

Original commit: elastic/x-pack-elasticsearch@9e8cf664c1
This commit is contained in:
Sophie Chang 2017-04-27 18:51:48 +01:00 committed by lcawley
parent 642b1f7c19
commit ffb3bb6493
8 changed files with 77 additions and 69 deletions

View File

@ -10,11 +10,11 @@ All {ml} endpoints have the following base:
The main {ml} resources can be accessed with a variety of endpoints: The main {ml} resources can be accessed with a variety of endpoints:
* <<ml-api-jobs,+/anomaly_detectors/+>>: Create and manage {ml} jobs. * <<ml-api-jobs,+/anomaly_detectors/+>>: Create and manage {ml} jobs
* <<ml-api-datafeeds,+/datafeeds/+>>: Update data to be analyzed. * <<ml-api-datafeeds,+/datafeeds/+>>: Select data from {es} to be analyzed
* <<ml-api-results,+/results/+>>: Access the results of a {ml} job. * <<ml-api-results,+/results/+>>: Access the results of a {ml} job
* <<ml-api-snapshots,+/model_snapshots/+>>: Manage model snapshots. * <<ml-api-snapshots,+/model_snapshots/+>>: Manage model snapshots
* <<ml-api-validate,+/validate/+>>: Validate subsections of job configurations. * <<ml-api-validate,+/validate/+>>: Validate subsections of job configurations
[float] [float]
[[ml-api-jobs]] [[ml-api-jobs]]

View File

@ -19,8 +19,8 @@ science-related configurations in order to get the benefits of {ml}.
=== Integration with the Elastic Stack === Integration with the Elastic Stack
Machine learning is tightly integrated with the Elastic Stack. Machine learning is tightly integrated with the Elastic Stack.
Data is pulled from {es} for analysis and anomaly results are displayed in Data is pulled from {es} for analysis and anomaly results are displayed in {kb}
{kb} dashboards. dashboards.
[float] [float]
[[ml-concepts]] [[ml-concepts]]
@ -36,23 +36,25 @@ Jobs::
with a job, see <<ml-job-resource, Job Resources>>. with a job, see <<ml-job-resource, Job Resources>>.
Data feeds:: Data feeds::
Jobs can analyze either a batch of data from a data store or a stream of data Jobs can analyze either a one-off batch of data or continuously in real-time.
in real-time. The latter involves data that is retrieved from {es} and is Data feeds retrieve data from {es} for analysis. Alternatively you can
referred to as a data feed. <<ml-post-data],POST data>> from any source directly to an API.
Detectors:: Detectors::
Part of the configuration information associated with a job, detectors define Part of the configuration information associated with a job, detectors define
the type of analysis that needs to be done (for example, max, average, rare). the type of analysis that needs to be done (for example, max, average, rare).
They also specify which fields to analyze. You can have more than one detector They also specify which fields to analyze. You can have more than one detector
in a job, which is more efficient than running multiple jobs against the same in a job, which is more efficient than running multiple jobs against the same
data stream. For a list of the properties associated with detectors, see data. For a list of the properties associated with detectors, see
<<ml-detectorconfig, Detector Configuration Objects>>. <<ml-detectorconfig, Detector Configuration Objects>>.
Buckets:: Buckets::
Part of the configuration information associated with a job, the _bucket span_ Part of the configuration information associated with a job, the _bucket span_
defines the time interval across which the job analyzes. When setting the defines the time interval used to summarize and model the data. This is typically
between 5 minutes to 1 hour, and it depends on your data characteristics. When setting the
bucket span, take into account the granularity at which you want to analyze, bucket span, take into account the granularity at which you want to analyze,
the frequency of the input data, and the frequency at which alerting is required. the frequency of the input data, the typical duration of the anomalies
and the frequency at which alerting is required.
Machine learning nodes:: Machine learning nodes::
A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`, A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`,

View File

@ -12,14 +12,14 @@ Use machine learning to detect anomalies in time series data.
[[ml-api-datafeed-endpoint]] [[ml-api-datafeed-endpoint]]
=== Data Feeds === Data Feeds
* <<ml-put-datafeed,Create data feeds>> * <<ml-put-datafeed,Create data feed>>
* <<ml-delete-datafeed,Delete data feeds>> * <<ml-delete-datafeed,Delete data feed>>
* <<ml-get-datafeed,Get data feeds>> * <<ml-get-datafeed,Get data feed info>>
* <<ml-get-datafeed-stats,Get data feed statistics>> * <<ml-get-datafeed-stats,Get data feed statistics>>
* <<ml-preview-datafeed,Preview data feeds>> * <<ml-preview-datafeed,Preview data feed>>
* <<ml-start-datafeed,Start data feeds>> * <<ml-start-datafeed,Start data feed>>
* <<ml-stop-datafeed,Stop data feeds>> * <<ml-stop-datafeed,Stop data feed>>
* <<ml-update-datafeed,Update data feeds>> * <<ml-update-datafeed,Update data feed>>
include::ml/put-datafeed.asciidoc[] include::ml/put-datafeed.asciidoc[]
include::ml/delete-datafeed.asciidoc[] include::ml/delete-datafeed.asciidoc[]
@ -35,15 +35,15 @@ include::ml/update-datafeed.asciidoc[]
You can use APIs to perform the following activities: You can use APIs to perform the following activities:
* <<ml-close-job,Close jobs>> * <<ml-close-job,Close job>>
* <<ml-put-job,Create jobs>> * <<ml-put-job,Create job>>
* <<ml-delete-job,Delete jobs>> * <<ml-delete-job,Delete job>>
* <<ml-get-job,Get jobs>> * <<ml-get-job,Get job info>>
* <<ml-get-job-stats,Get job statistics>> * <<ml-get-job-stats,Get job statistics>>
* <<ml-flush-job,Flush jobs>> * <<ml-flush-job,Flush job>>
* <<ml-open-job,Open jobs>> * <<ml-open-job,Open job>>
* <<ml-post-data,Post data to jobs>> * <<ml-post-data,Post data to job>>
* <<ml-update-job,Update jobs>> * <<ml-update-job,Update job>>
* <<ml-valid-detector,Validate detectors>> * <<ml-valid-detector,Validate detectors>>
* <<ml-valid-job,Validate job>> * <<ml-valid-job,Validate job>>
@ -62,10 +62,10 @@ include::ml/validate-job.asciidoc[]
[[ml-api-snapshot-endpoint]] [[ml-api-snapshot-endpoint]]
=== Model Snapshots === Model Snapshots
* <<ml-delete-snapshot,Delete model snapshots>> * <<ml-delete-snapshot,Delete model snapshot>>
* <<ml-get-snapshot,Get model snapshots>> * <<ml-get-snapshot,Get model snapshot info>>
* <<ml-revert-snapshot,Revert model snapshots>> * <<ml-revert-snapshot,Revert model snapshot>>
* <<ml-update-snapshot,Update model snapshots>> * <<ml-update-snapshot,Update model snapshot>>
include::ml/delete-snapshot.asciidoc[] include::ml/delete-snapshot.asciidoc[]
include::ml/get-snapshot.asciidoc[] include::ml/get-snapshot.asciidoc[]
@ -91,7 +91,7 @@ include::ml/get-record.asciidoc[]
* <<ml-datafeed-resource,Data feeds>> * <<ml-datafeed-resource,Data feeds>>
* <<ml-datafeed-counts,Data feed counts>> * <<ml-datafeed-counts,Data feed counts>>
* <<ml-job-resource,Jobs>> * <<ml-job-resource,Jobs>>
* <<ml-jobstats,Job Stats>> * <<ml-jobstats,Job statistics>>
* <<ml-snapshot-resource,Model snapshots>> * <<ml-snapshot-resource,Model snapshots>>
* <<ml-results-resource,Results>> * <<ml-results-resource,Results>>

View File

@ -7,16 +7,18 @@ A data feed resource has the following properties:
`aggregations`:: `aggregations`::
(object) If set, the data feed performs aggregation searches. (object) If set, the data feed performs aggregation searches.
For syntax information, see {ref}/search-aggregations.html[Aggregations]. For syntax information, see {ref}/search-aggregations.html[Aggregations].
Support for aggregations is limited: TBD. Support for aggregations is limited and should only be used with
low cardinality data:
For example: For example:
`{"@timestamp": {"histogram": {"field": "@timestamp", `{"@timestamp": {"histogram": {"field": "@timestamp",
"interval": 30000,"offset": 0,"order": {"_key": "asc"},"keyed": false, "interval": 30000,"offset": 0,"order": {"_key": "asc"},"keyed": false,
"min_doc_count": 0}, "aggregations": {"events_per_min": {"sum": { "min_doc_count": 0}, "aggregations": {"events_per_min": {"sum": {
"field": "events_per_min"}}}}}`. "field": "events_per_min"}}}}}`.
//TBD link to a Working with aggregations page
`chunking_config`:: `chunking_config`::
(object) The chunking configuration, which specifies how data searches are (object) Specifies how data searches are split into time chunks.
chunked. See <<ml-datafeed-chunking-config>>. See <<ml-datafeed-chunking-config>>.
For example: {"mode": "manual", "time_span": "3h"} For example: {"mode": "manual", "time_span": "3h"}
`datafeed_id`:: `datafeed_id`::
@ -39,14 +41,12 @@ A data feed resource has the following properties:
corresponds to the query object in an Elasticsearch search POST body. All the corresponds to the query object in an Elasticsearch search POST body. All the
options that are supported by Elasticsearch can be used, as this object is options that are supported by Elasticsearch can be used, as this object is
passed verbatim to Elasticsearch. By default, this property has the following passed verbatim to Elasticsearch. By default, this property has the following
value: `{"match_all": {"boost": 1}}`. If this property is not specified, the value: `{"match_all": {"boost": 1}}`.
default value is `“match_all”: {}`.
`query_delay`:: `query_delay`::
(time units) The number of seconds behind real-time that data is queried. For (time units) The number of seconds behind real-time that data is queried. For
example, if data from 10:04 a.m. might not be searchable in Elasticsearch example, if data from 10:04 a.m. might not be searchable in Elasticsearch
until 10:06 a.m., set this property to 120 seconds. The default value is 60 until 10:06 a.m., set this property to 120 seconds. The default value is `60s`.
seconds. For example: "60s".
`scroll_size`:: `scroll_size`::
(unsigned integer) The `size` parameter that is used in Elasticsearch searches. (unsigned integer) The `size` parameter that is used in Elasticsearch searches.
@ -59,11 +59,17 @@ A data feed resource has the following properties:
[[ml-datafeed-chunking-config]] [[ml-datafeed-chunking-config]]
===== Chunking Configuration Objects ===== Chunking Configuration Objects
Data feeds may be required to search over long time periods, for several months
or years. This search is split into time chunks in order to ensure the load
on {es} is managed. Chunking configuration controls how the size of these time
chunks are calculated and is an advanced configuration option.
A chunking configuration object has the following properties: A chunking configuration object has the following properties:
`mode` (required):: `mode` (required)::
There are three available modes: + There are three available modes: +
`auto`::: The chunk size will be dynamically calculated. `auto`::: The chunk size will be dynamically calculated. This is the default
and recommended value.
`manual`::: Chunking will be applied according to the specified `time_span`. `manual`::: Chunking will be applied according to the specified `time_span`.
`off`::: No chunking will be applied. `off`::: No chunking will be applied.
@ -79,20 +85,20 @@ A chunking configuration object has the following properties:
The get data feed statistics API provides information about the operational The get data feed statistics API provides information about the operational
progress of a data feed. For example: progress of a data feed. For example:
`assigment_explanation`:: `assignment_explanation`::
TBD. For example: " " (string) For started data feeds only, contains messages relating to the selection
of a node.
`datafeed_id`:: `datafeed_id`::
(string) A numerical character string that uniquely identifies the data feed. (string) A numerical character string that uniquely identifies the data feed.
`node`:: `node`::
(object) TBD (object) The node upon which the data feed is started. The data feed and job will be on the same node.
The node that is running the query? `id`::: The unique identifier of the node. For example, "0-o0tOoRTwKFZifatTWKNw".
`id`::: TBD. For example, "0-o0tOoRTwKFZifatTWKNw". `name`::: The node name. For example, "0-o0tOo".
`name`::: TBD. For example, "0-o0tOo". `ephemeral_id`::: The node ephemeral id.
`ephemeral_id`::: TBD. For example, "DOZltLxLS_SzYpW6hQ9hyg". `transport_address`::: The host and port where transport HTTP connections are accepted. For example, "127.0.0.1:9300".
`transport_address`::: TBD. For example, "127.0.0.1:9300". `attributes`::: For example, {"max_running_jobs": "10"}.
`attributes`::: TBD. For example, {"max_running_jobs": "10"}.
`state`:: `state`::
(string) The status of the data feed, which can be one of the following values: + (string) The status of the data feed, which can be one of the following values: +

View File

@ -118,14 +118,8 @@ necessarily a cause for concern.
This value includes records with missing fields, since they are nonetheless This value includes records with missing fields, since they are nonetheless
analyzed. + analyzed. +
If you use data feeds and have aggregations in your search query, If you use data feeds and have aggregations in your search query,
the `processed_record_count` differs from the `input_record_count`. + the `processed_record_count` will be the number of aggregated records
If you use the <<ml-post-data,post data API>> to provide data to the job, processed, not the number of {es} documents.
the following records are not processed: +
+
--
* Records not in chronological order and outside the latency window
* Records with invalid timestamp
--
`sparse_bucket_count`:: `sparse_bucket_count`::
(long) The number of buckets that contained few data points compared to the (long) The number of buckets that contained few data points compared to the
@ -167,12 +161,12 @@ The `model_size_stats` object has the following properties:
(string) For internal use. The type of result. (string) For internal use. The type of result.
`total_by_field_count`:: `total_by_field_count`::
(long) The number of `by` field values that were analyzed by the models. (long) The number of `by` field values that were analyzed by the models.+
NOTE: The `by` field values are counted separately for each detector and partition. NOTE: The `by` field values are counted separately for each detector and partition.
`total_over_field_count`:: `total_over_field_count`::
(long) The number of `over` field values that were analyzed by the models. (long) The number of `over` field values that were analyzed by the models.+
NOTE: The `over` field values are counted separately for each detector and partition. NOTE: The `over` field values are counted separately for each detector and partition.
@ -196,12 +190,10 @@ This information is available only for open jobs.
(string) The node name. (string) The node name.
`ephemeral_id`:: `ephemeral_id`::
(string) The ephemeral id of the node.
`transport_address`:: `transport_address`::
(string) The host and port where transport HTTP connections are accepted. (string) The host and port where transport HTTP connections are accepted.
`attributes`:: `attributes`::
(object) {ml} attributes. (object) For example, {"max_running_jobs": "10"}.
`max_running_jobs`::: The maximum number of concurrently open jobs that are
allowed per node.

View File

@ -15,9 +15,17 @@ The job must have been opened prior to sending data.
File sizes are limited to 100 Mb, so if your file is larger, File sizes are limited to 100 Mb, so if your file is larger,
then split it into multiple files and upload each one separately in sequential time order. then split it into multiple files and upload each one separately in sequential time order.
When running in real-time, it is generally recommended to arrange to perform When running in real-time, it is generally recommended to perform
many small uploads, rather than queueing data to upload larger files. many small uploads, rather than queueing data to upload larger files.
When uploading data, check the <<ml-datacounts,job data counts>> for progress.
The following records will not be processed:
* Records not in chronological order and outside the latency window
* Records with an invalid timestamp
//TBD link to Working with Out of Order timeseries concept doc
IMPORTANT: Data can only be accepted from a single connection. IMPORTANT: Data can only be accepted from a single connection.
Use a single connection synchronously to send data, close, flush, or delete a single job. Use a single connection synchronously to send data, close, flush, or delete a single job.
It is not currently possible to post data to multiple jobs using wildcards It is not currently possible to post data to multiple jobs using wildcards

View File

@ -14,7 +14,6 @@ When choosing a new value, consider the following:
* Persistence enables snapshots to be reverted. * Persistence enables snapshots to be reverted.
* The time taken to persist a job is proportional to the size of the model in memory. * The time taken to persist a job is proportional to the size of the model in memory.
//* The smallest allowed value is 3600 (1 hour). //* The smallest allowed value is 3600 (1 hour).
////
A model snapshot resource has the following properties: A model snapshot resource has the following properties:
@ -34,7 +33,8 @@ A model snapshot resource has the following properties:
(object) Summary information describing the model. See <<ml-snapshot-stats,Model Size Statistics>>. (object) Summary information describing the model. See <<ml-snapshot-stats,Model Size Statistics>>.
`retain`:: `retain`::
(boolean) If true, this snapshot will not be deleted during automatic cleanup of snapshots older than `model_snapshot_retention_days`. (boolean) If true, this snapshot will not be deleted during automatic cleanup of snapshots
older than `model_snapshot_retention_days`.
However, this snapshot will be deleted when the job is deleted. However, this snapshot will be deleted when the job is deleted.
The default value is false. The default value is false.
@ -89,4 +89,4 @@ The `model_size_stats` object has the following properties:
`total_partition_field_count`:: `total_partition_field_count`::
(long) The number of _partition_ field values analyzed. (long) The number of _partition_ field values analyzed.
////

View File

@ -1,6 +1,6 @@
[[ml-settings]] [[ml-settings]]
== Machine Learning Settings == Machine Learning Settings
You do not need to configure any settings to use {ml}. You do not need to configure any settings to use {ml}. It is enabled by default.
[float] [float]
[[general-ml-settings]] [[general-ml-settings]]