* [DOCS] Overall review

* [DOCS] General review

* [DOCS] typo

* [DOCS] Fix for processed_record_count with aggs

* [DOCS] Added latency tbd

Original commit: elastic/x-pack-elasticsearch@9e8cf664c1
This commit is contained in:
Sophie Chang 2017-04-27 18:51:48 +01:00 committed by lcawley
parent 642b1f7c19
commit ffb3bb6493
8 changed files with 77 additions and 69 deletions

View File

@ -10,11 +10,11 @@ All {ml} endpoints have the following base:
The main {ml} resources can be accessed with a variety of endpoints:
* <<ml-api-jobs,+/anomaly_detectors/+>>: Create and manage {ml} jobs.
* <<ml-api-datafeeds,+/datafeeds/+>>: Update data to be analyzed.
* <<ml-api-results,+/results/+>>: Access the results of a {ml} job.
* <<ml-api-snapshots,+/model_snapshots/+>>: Manage model snapshots.
* <<ml-api-validate,+/validate/+>>: Validate subsections of job configurations.
* <<ml-api-jobs,+/anomaly_detectors/+>>: Create and manage {ml} jobs
* <<ml-api-datafeeds,+/datafeeds/+>>: Select data from {es} to be analyzed
* <<ml-api-results,+/results/+>>: Access the results of a {ml} job
* <<ml-api-snapshots,+/model_snapshots/+>>: Manage model snapshots
* <<ml-api-validate,+/validate/+>>: Validate subsections of job configurations
[float]
[[ml-api-jobs]]

View File

@ -19,8 +19,8 @@ science-related configurations in order to get the benefits of {ml}.
=== Integration with the Elastic Stack
Machine learning is tightly integrated with the Elastic Stack.
Data is pulled from {es} for analysis and anomaly results are displayed in
{kb} dashboards.
Data is pulled from {es} for analysis and anomaly results are displayed in {kb}
dashboards.
[float]
[[ml-concepts]]
@ -36,23 +36,25 @@ Jobs::
with a job, see <<ml-job-resource, Job Resources>>.
Data feeds::
Jobs can analyze either a batch of data from a data store or a stream of data
in real-time. The latter involves data that is retrieved from {es} and is
referred to as a data feed.
Jobs can analyze either a one-off batch of data or continuously in real-time.
Data feeds retrieve data from {es} for analysis. Alternatively you can
<<ml-post-data],POST data>> from any source directly to an API.
Detectors::
Part of the configuration information associated with a job, detectors define
the type of analysis that needs to be done (for example, max, average, rare).
They also specify which fields to analyze. You can have more than one detector
in a job, which is more efficient than running multiple jobs against the same
data stream. For a list of the properties associated with detectors, see
data. For a list of the properties associated with detectors, see
<<ml-detectorconfig, Detector Configuration Objects>>.
Buckets::
Part of the configuration information associated with a job, the _bucket span_
defines the time interval across which the job analyzes. When setting the
defines the time interval used to summarize and model the data. This is typically
between 5 minutes to 1 hour, and it depends on your data characteristics. When setting the
bucket span, take into account the granularity at which you want to analyze,
the frequency of the input data, and the frequency at which alerting is required.
the frequency of the input data, the typical duration of the anomalies
and the frequency at which alerting is required.
Machine learning nodes::
A {ml} node is a node that has `xpack.ml.enabled` and `node.ml` set to `true`,

View File

@ -12,14 +12,14 @@ Use machine learning to detect anomalies in time series data.
[[ml-api-datafeed-endpoint]]
=== Data Feeds
* <<ml-put-datafeed,Create data feeds>>
* <<ml-delete-datafeed,Delete data feeds>>
* <<ml-get-datafeed,Get data feeds>>
* <<ml-put-datafeed,Create data feed>>
* <<ml-delete-datafeed,Delete data feed>>
* <<ml-get-datafeed,Get data feed info>>
* <<ml-get-datafeed-stats,Get data feed statistics>>
* <<ml-preview-datafeed,Preview data feeds>>
* <<ml-start-datafeed,Start data feeds>>
* <<ml-stop-datafeed,Stop data feeds>>
* <<ml-update-datafeed,Update data feeds>>
* <<ml-preview-datafeed,Preview data feed>>
* <<ml-start-datafeed,Start data feed>>
* <<ml-stop-datafeed,Stop data feed>>
* <<ml-update-datafeed,Update data feed>>
include::ml/put-datafeed.asciidoc[]
include::ml/delete-datafeed.asciidoc[]
@ -35,15 +35,15 @@ include::ml/update-datafeed.asciidoc[]
You can use APIs to perform the following activities:
* <<ml-close-job,Close jobs>>
* <<ml-put-job,Create jobs>>
* <<ml-delete-job,Delete jobs>>
* <<ml-get-job,Get jobs>>
* <<ml-close-job,Close job>>
* <<ml-put-job,Create job>>
* <<ml-delete-job,Delete job>>
* <<ml-get-job,Get job info>>
* <<ml-get-job-stats,Get job statistics>>
* <<ml-flush-job,Flush jobs>>
* <<ml-open-job,Open jobs>>
* <<ml-post-data,Post data to jobs>>
* <<ml-update-job,Update jobs>>
* <<ml-flush-job,Flush job>>
* <<ml-open-job,Open job>>
* <<ml-post-data,Post data to job>>
* <<ml-update-job,Update job>>
* <<ml-valid-detector,Validate detectors>>
* <<ml-valid-job,Validate job>>
@ -62,10 +62,10 @@ include::ml/validate-job.asciidoc[]
[[ml-api-snapshot-endpoint]]
=== Model Snapshots
* <<ml-delete-snapshot,Delete model snapshots>>
* <<ml-get-snapshot,Get model snapshots>>
* <<ml-revert-snapshot,Revert model snapshots>>
* <<ml-update-snapshot,Update model snapshots>>
* <<ml-delete-snapshot,Delete model snapshot>>
* <<ml-get-snapshot,Get model snapshot info>>
* <<ml-revert-snapshot,Revert model snapshot>>
* <<ml-update-snapshot,Update model snapshot>>
include::ml/delete-snapshot.asciidoc[]
include::ml/get-snapshot.asciidoc[]
@ -91,7 +91,7 @@ include::ml/get-record.asciidoc[]
* <<ml-datafeed-resource,Data feeds>>
* <<ml-datafeed-counts,Data feed counts>>
* <<ml-job-resource,Jobs>>
* <<ml-jobstats,Job Stats>>
* <<ml-jobstats,Job statistics>>
* <<ml-snapshot-resource,Model snapshots>>
* <<ml-results-resource,Results>>

View File

@ -7,16 +7,18 @@ A data feed resource has the following properties:
`aggregations`::
(object) If set, the data feed performs aggregation searches.
For syntax information, see {ref}/search-aggregations.html[Aggregations].
Support for aggregations is limited: TBD.
Support for aggregations is limited and should only be used with
low cardinality data:
For example:
`{"@timestamp": {"histogram": {"field": "@timestamp",
"interval": 30000,"offset": 0,"order": {"_key": "asc"},"keyed": false,
"min_doc_count": 0}, "aggregations": {"events_per_min": {"sum": {
"field": "events_per_min"}}}}}`.
//TBD link to a Working with aggregations page
`chunking_config`::
(object) The chunking configuration, which specifies how data searches are
chunked. See <<ml-datafeed-chunking-config>>.
(object) Specifies how data searches are split into time chunks.
See <<ml-datafeed-chunking-config>>.
For example: {"mode": "manual", "time_span": "3h"}
`datafeed_id`::
@ -39,14 +41,12 @@ A data feed resource has the following properties:
corresponds to the query object in an Elasticsearch search POST body. All the
options that are supported by Elasticsearch can be used, as this object is
passed verbatim to Elasticsearch. By default, this property has the following
value: `{"match_all": {"boost": 1}}`. If this property is not specified, the
default value is `“match_all”: {}`.
value: `{"match_all": {"boost": 1}}`.
`query_delay`::
(time units) The number of seconds behind real-time that data is queried. For
example, if data from 10:04 a.m. might not be searchable in Elasticsearch
until 10:06 a.m., set this property to 120 seconds. The default value is 60
seconds. For example: "60s".
until 10:06 a.m., set this property to 120 seconds. The default value is `60s`.
`scroll_size`::
(unsigned integer) The `size` parameter that is used in Elasticsearch searches.
@ -59,11 +59,17 @@ A data feed resource has the following properties:
[[ml-datafeed-chunking-config]]
===== Chunking Configuration Objects
Data feeds may be required to search over long time periods, for several months
or years. This search is split into time chunks in order to ensure the load
on {es} is managed. Chunking configuration controls how the size of these time
chunks are calculated and is an advanced configuration option.
A chunking configuration object has the following properties:
`mode` (required)::
There are three available modes: +
`auto`::: The chunk size will be dynamically calculated.
`auto`::: The chunk size will be dynamically calculated. This is the default
and recommended value.
`manual`::: Chunking will be applied according to the specified `time_span`.
`off`::: No chunking will be applied.
@ -79,20 +85,20 @@ A chunking configuration object has the following properties:
The get data feed statistics API provides information about the operational
progress of a data feed. For example:
`assigment_explanation`::
TBD. For example: " "
`assignment_explanation`::
(string) For started data feeds only, contains messages relating to the selection
of a node.
`datafeed_id`::
(string) A numerical character string that uniquely identifies the data feed.
`node`::
(object) TBD
The node that is running the query?
`id`::: TBD. For example, "0-o0tOoRTwKFZifatTWKNw".
`name`::: TBD. For example, "0-o0tOo".
`ephemeral_id`::: TBD. For example, "DOZltLxLS_SzYpW6hQ9hyg".
`transport_address`::: TBD. For example, "127.0.0.1:9300".
`attributes`::: TBD. For example, {"max_running_jobs": "10"}.
(object) The node upon which the data feed is started. The data feed and job will be on the same node.
`id`::: The unique identifier of the node. For example, "0-o0tOoRTwKFZifatTWKNw".
`name`::: The node name. For example, "0-o0tOo".
`ephemeral_id`::: The node ephemeral id.
`transport_address`::: The host and port where transport HTTP connections are accepted. For example, "127.0.0.1:9300".
`attributes`::: For example, {"max_running_jobs": "10"}.
`state`::
(string) The status of the data feed, which can be one of the following values: +

View File

@ -118,14 +118,8 @@ necessarily a cause for concern.
This value includes records with missing fields, since they are nonetheless
analyzed. +
If you use data feeds and have aggregations in your search query,
the `processed_record_count` differs from the `input_record_count`. +
If you use the <<ml-post-data,post data API>> to provide data to the job,
the following records are not processed: +
+
--
* Records not in chronological order and outside the latency window
* Records with invalid timestamp
--
the `processed_record_count` will be the number of aggregated records
processed, not the number of {es} documents.
`sparse_bucket_count`::
(long) The number of buckets that contained few data points compared to the
@ -167,12 +161,12 @@ The `model_size_stats` object has the following properties:
(string) For internal use. The type of result.
`total_by_field_count`::
(long) The number of `by` field values that were analyzed by the models.
(long) The number of `by` field values that were analyzed by the models.+
NOTE: The `by` field values are counted separately for each detector and partition.
`total_over_field_count`::
(long) The number of `over` field values that were analyzed by the models.
(long) The number of `over` field values that were analyzed by the models.+
NOTE: The `over` field values are counted separately for each detector and partition.
@ -196,12 +190,10 @@ This information is available only for open jobs.
(string) The node name.
`ephemeral_id`::
(string) The ephemeral id of the node.
`transport_address`::
(string) The host and port where transport HTTP connections are accepted.
`attributes`::
(object) {ml} attributes.
`max_running_jobs`::: The maximum number of concurrently open jobs that are
allowed per node.
(object) For example, {"max_running_jobs": "10"}.

View File

@ -15,9 +15,17 @@ The job must have been opened prior to sending data.
File sizes are limited to 100 Mb, so if your file is larger,
then split it into multiple files and upload each one separately in sequential time order.
When running in real-time, it is generally recommended to arrange to perform
When running in real-time, it is generally recommended to perform
many small uploads, rather than queueing data to upload larger files.
When uploading data, check the <<ml-datacounts,job data counts>> for progress.
The following records will not be processed:
* Records not in chronological order and outside the latency window
* Records with an invalid timestamp
//TBD link to Working with Out of Order timeseries concept doc
IMPORTANT: Data can only be accepted from a single connection.
Use a single connection synchronously to send data, close, flush, or delete a single job.
It is not currently possible to post data to multiple jobs using wildcards

View File

@ -14,7 +14,6 @@ When choosing a new value, consider the following:
* Persistence enables snapshots to be reverted.
* The time taken to persist a job is proportional to the size of the model in memory.
//* The smallest allowed value is 3600 (1 hour).
////
A model snapshot resource has the following properties:
@ -34,7 +33,8 @@ A model snapshot resource has the following properties:
(object) Summary information describing the model. See <<ml-snapshot-stats,Model Size Statistics>>.
`retain`::
(boolean) If true, this snapshot will not be deleted during automatic cleanup of snapshots older than `model_snapshot_retention_days`.
(boolean) If true, this snapshot will not be deleted during automatic cleanup of snapshots
older than `model_snapshot_retention_days`.
However, this snapshot will be deleted when the job is deleted.
The default value is false.
@ -89,4 +89,4 @@ The `model_size_stats` object has the following properties:
`total_partition_field_count`::
(long) The number of _partition_ field values analyzed.
////

View File

@ -1,6 +1,6 @@
[[ml-settings]]
== Machine Learning Settings
You do not need to configure any settings to use {ml}.
You do not need to configure any settings to use {ml}. It is enabled by default.
[float]
[[general-ml-settings]]