[DOCS] Review of API docs part 1 (elastic/x-pack-elasticsearch#1118)

* [DOCS] Review of close job and job stats

* [DOCS] Add force close

* [DOCS] Remove invalid params from get records

* [DOCS] Remove invalid params from get buckets

* [DOCS] Job resource corrections

Original commit: elastic/x-pack-elasticsearch@bc68d05097
This commit is contained in:
Sophie Chang 2017-04-19 18:52:30 +01:00 committed by lcawley
parent 7ee48846ec
commit e2cc00ab8e
7 changed files with 152 additions and 106 deletions

View File

@ -91,7 +91,7 @@ include::ml/get-record.asciidoc[]
* <<ml-datafeed-resource,Data feeds>>
* <<ml-datafeed-counts,Data feed counts>>
* <<ml-job-resource,Jobs>>
* <<ml-jobcounts,Job counts>>
* <<ml-jobstats,Job Stats>>
* <<ml-snapshot-resource,Model snapshots>>
* <<ml-results-resource,Results>>

View File

@ -5,6 +5,9 @@
The close job API enables you to close a job.
A job can be opened and closed multiple times throughout its lifecycle.
A closed job cannot receive data or perform analysis
operations, but you can still explore and navigate results.
===== Request
`POST _xpack/ml/anomaly_detectors/<job_id>/_close`
@ -18,14 +21,15 @@ flushing buffers, calculating final results and persisting the model snapshots.
Depending upon the size of the job, it could take several minutes to close and
the equivalent time to re-open.
After it is closed, the job has almost no overhead on the cluster except for
maintaining its meta data. A closed job cannot receive data or perform analysis
operations, but you can still explore and navigate results.
After it is closed, the job has a minimal overhead on the cluster except for
maintaining its meta data.
Therefore it is best practice to close jobs that are no longer required to process data.
When a datafeed that has a specified end date stops, it will automatically close the job.
You must have `manage_ml`, or `manage` cluster privileges to use this API.
For more information, see <<privileges-list-cluster>>.
//NOTE: TBD
//OUTDATED?: If using the {prelert} UI, the job will be automatically closed when stopping a datafeed job.
===== Path Parameters
@ -35,9 +39,14 @@ For more information, see <<privileges-list-cluster>>.
===== Query Parameters
`close_timeout`::
(time) Controls the time to wait until a job has closed.
(time units) Controls the time to wait until a job has closed.
The default value is 30 minutes.
`force`::
(boolean) Use to close a failed job, or to forcefully close a job which has not
responded to its initial close request.
////
===== Responses

View File

@ -29,7 +29,7 @@ The API returns the following information:
`jobs`::
(array) An array of job count objects.
For more information, see <<ml-jobcounts,Job Counts>>.
For more information, see <<ml-jobstats,Job Stats>>.
////
===== Responses

View File

@ -27,10 +27,6 @@ privileges to use this API. For more information, see <<privileges-list-cluster>
`end`::
(string) Returns records with timestamps earlier than this time.
`expand`::
(boolean) TBD
//This field did not work on older build.
`from`::
(integer) Skips the specified number of records.

View File

@ -1,12 +1,12 @@
//lcawley Verified example output 2017-04-11
[[ml-jobcounts]]
==== Job Counts
[[ml-jobstats]]
==== Job Stats
The get job statistics API provides information about the operational
progress of a job.
NOTE: Job count values are cumulative for the lifetime of a job. If a model snapshot is reverted
or old results are deleted, the job counts are not reset.
`assignment_explanation`::
(string) For open jobs only, contains messages relating to the selection of an executing node.
`data_counts`::
(object) An object that describes the number of records processed and any related error counts.
@ -19,31 +19,36 @@ or old results are deleted, the job counts are not reset.
(object) An object that provides information about the size and contents of the model.
See <<ml-modelsizestats,model size stats objects>>
`node`::
(object) For open jobs only, contains information about the executing node.
See <<ml-stats-node,node object>>.
`open_time`::
(string) For open jobs only, the elapsed time for which the job has been open.
E.g. `28746386s`.
`state`::
(string) The status of the job, which can be one of the following values:
`closed`::: The job finished successfully with its model state persisted.
The job is still available to accept further data. +
+
--
NOTE: If you send data in a periodic cycle and close the job at the end of
each transaction, the job is marked as closed in the intervals between
when data is sent. For example, if data is sent every minute and it takes
1 second to process, the job has a closed state for 59 seconds.
--
`closing`::: TBD. The job is in the process of closing?
`open`::: The job is available to receive and process data.
`closed`::: The job finished successfully with its model state persisted.
The job must be opened before it can accept further data.
`closing`::: The job close action is in progress and has not yet completed.
A closing job cannot accept further data.
`failed`::: The job did not finish successfully due to an error.
This situation can occur due to invalid input data. In this case,
sending corrected data to a failed job re-opens the job and
resets it to an open state.
`open`::: The job is actively receiving and processing data.
This situation can occur due to invalid input data.
If the job had irrevecobly failed, it must be force closed and then deleted.
If the datafeed can be corrected, the job can be closed and then re-opened.
[float]
[[ml-datacounts]]
===== Data Counts Objects
The `data_counts` object describes the number of records processed
and any related error counts. It has the following properties:
and any related error counts.
The `data_count` values are cumulative for the lifetime of a job. If a model snapshot is reverted
or old results are deleted, the job counts are not reset.
`bucket_count`::
(long) The number of bucket results produced by the job.
@ -53,7 +58,9 @@ and any related error counts. It has the following properties:
The datetime string is in ISO 8601 format.
`empty_bucket_count`::
() TBD
(long) The number of buckets which did not contain any data. If your data contains many
empty buckets, consider increasing your `bucket_span` or using functions that are tolerant
to gaps in data such as `mean`, `non_null_sum` or `non_zero_count`.
`input_bytes`::
(long) The number of raw bytes read by the job.
@ -72,16 +79,16 @@ and any related error counts. It has the following properties:
(string) A numerical character string that uniquely identifies the job.
`last_data_time`::
() TBD
(datetime) The timestamp at which data was last analyzed, according to server time.
`latest_empty_bucket_timestamp`::
(date) The timestamp of the last bucket that did not contain any data.
`latest_record_timestamp`::
(string) The timestamp of the last chronologically ordered record.
If the records are not in strict chronological order, this value might not be
the same as the timestamp of the last record.
The datetime string is in ISO 8601 format.
(date) The timestamp of the last processed record.
`latest_sparse_bucket_timestamp`::
() TBD
(date) The timestamp of the last bucket that was considered sparse.
`missing_field_count`::
(long) The number of records that are missing a field that the job is configured to analyze.
@ -97,6 +104,7 @@ necessarily a cause for concern.
`out_of_order_timestamp_count`::
(long) The number of records that are out of time sequence and outside of the latency window.
This is only applicable when using the `_data` endpoint.
These records are discarded, since jobs require time series data to be in ascending chronological order.
`processed_field_count`::
@ -108,13 +116,16 @@ necessarily a cause for concern.
(long) The number of records that have been processed by the job.
This value includes records with missing fields, since they are nonetheless analyzed.
+
The following records are not processed:
When using datafeeds, the `processed_record_count` will differ from the `input_record_count`
if you are using aggregations in your search query.
+
When posting to the `/_data` endpoint, the following records are not processed:
* Records not in chronological order and outside the latency window
* Records with invalid timestamp
* Records filtered by an exclude transform
`sparse_bucket_count`::
() TBD
(long) The number of buckets which contained few data points compared to the expected number
of data points. If your data contains many sparse buckets, consider using a longer `bucket_span`.
[float]
[[ml-modelsizestats]]
@ -123,13 +134,14 @@ necessarily a cause for concern.
The `model_size_stats` object has the following properties:
`bucket_allocation_failures_count`::
() TBD
(long) The number of buckets for which new entites in incoming data was not processed due to a
insufficient model memory as signified by a `hard_limit` `memory_status`.
`job_id`::
(string) A numerical character string that uniquely identifies the job.
`log_time`::
() TBD
(date) The timestamp of the `model_size_stats` according to server time.
`memory_status`::
(string) The status of the mathematical models. This property can have one of the following values:
@ -142,24 +154,46 @@ The `model_size_stats` object has the following properties:
last time the model was persisted. If the job is closed, this value indicates the latest size.
`result_type`::
TBD
(string) For internal use. The type of result.
`total_by_field_count`::
(long) The number of `by` field values that were analyzed by the models. +
(long) The number of `by` field values that were analyzed by the models.
+
--
NOTE: The `by` field values are counted separately for each detector and partition.
--
`total_over_field_count`::
(long) The number of `over` field values that were analyzed by the models. +
(long) The number of `over` field values that were analyzed by the models.
+
--
NOTE: The `over` field values are counted separately for each detector and partition.
--
`total_partition_field_count`::
(long) The number of `partition` field values that were analyzed by the models.
`timestamp`::
TBD
(date) The timestamp of the `model_size_stats` according to the timestamp of the data.
[float]
[[ml-stats-node]]
===== Node Objects
The `node` objects contains properties of the executing node and is only available for open jobs.
`id`::
(string) The unique identifier of the executing node.
`name`::
(string) The node's name.
`ephemeral_id`::
`transport_address`::
(string) Host and port where transport HTTP connections are accepted.
`attributes`::
(object) {ml} attributes.
`max_running_jobs`::: The maximum number of concurrently open jobs allowed per node.

View File

@ -5,10 +5,11 @@
A job resource has the following properties:
`analysis_config`::
(object) The analysis configuration, which specifies how to analyze the data. See <<ml-analysisconfig, analysis configuration objects>>.
(object) The analysis configuration, which specifies how to analyze the data.
See <<ml-analysisconfig, analysis configuration objects>>.
`analysis_limits`::
(object) Defines limits on the number of field values and time buckets to be analyzed.
(object) Defines approximate limits on the memory resource requirements for the job.
See <<ml-apilimits,analysis limits>>.
`create_time`::
@ -21,17 +22,17 @@ A job resource has the following properties:
(string) An optional description of the job.
`finished_time`::
(string) If the job closed of failed, this is the time the job finished, in ISO 8601 format.
Otherwise, it is `null`. For example, `1491007365347`.
(string) If the job closed or failed, this is the time the job finished, otherwise it is `null`.
`job_id`::
(string) A numerical character string that uniquely identifies the job.
(string) The unique identifier for the job.
`job_type`::
(string) TBD. For example: "anomaly_detector".
(string) Reserved for future use, currently set to `anomaly_detector`.
`model_plot_config`:: TBD
`enabled`:: TBD. For example, `true`.
`model_plot_config`::
(object) Configuration properties for storing additional model information.
See <<ml-apimodelplotconfig, model plot configuration>>.
`model_snapshot_id`::
(string) A numerical character string that uniquely identifies the model
@ -42,7 +43,8 @@ A job resource has the following properties:
Older snapshots are deleted. The default value is 1 day.
`results_index_name`::
() TBD. For example, `shared`.
(string) The name of the index in which to store the results generated by {ml} results.
The default value is `shared` which corresponds to the index name `.ml-anomalies-shared`
[[ml-analysisconfig]]
===== Analysis Configuration Objects
@ -50,8 +52,8 @@ A job resource has the following properties:
An analysis configuration object has the following properties:
`bucket_span` (required)::
(unsigned integer) The size of the interval that the analysis is aggregated into, measured in seconds. The default value is 5 minutes.
//TBD: Is this now measured in minutes?
(time units) The size of the interval that the analysis is aggregated into, typically between `5m` and `1h`.
The default value is `5m`.
`categorization_field_name`::
(string) If not null, the values of the specified field will be categorized.
@ -84,9 +86,9 @@ and an error is returned.
the use of influencers is recommended as it aggregates results for each influencer entity.
`latency`::
(unsigned integer) The size of the window, in seconds, in which to expect data that is out of time order. The default value is 0 milliseconds (no latency). +
+
--
(unsigned integer) The size of the window, in seconds, in which to expect data that is out of time order.
The default value is 0 (no latency).
NOTE: Latency is only applicable when you send data by using the <<ml-post-data, Post Data to Jobs>> API.
--
@ -103,10 +105,10 @@ NOTE: Latency is only applicable when you send data by using the <<ml-post-data,
--
NOTE: To use the `multivariate_by_fields` property, you must also specify `by_field_name` in your detector.
--
`overlapping_buckets`::
(boolean) If set to `true`, an additional analysis occurs that runs out of phase by half a bucket length.
This requires more system resources and enhances detection of anomalies that span bucket boundaries.
// LEAVE UNDOCUMENTED
// `overlapping_buckets`::
// (boolean) If set to `true`, an additional analysis occurs that runs out of phase by half a bucket length.
// This requires more system resources and enhances detection of anomalies that span bucket boundaries.
`summary_count_field_name`::
(string) If not null, the data fed to the job is expected to be pre-summarized.
@ -115,10 +117,11 @@ NOTE: To use the `multivariate_by_fields` property, you must also specify `by_fi
+
--
NOTE: The `summary_count_field_name` property cannot be used with the `metric` function.
--
`use_per_partition_normalization`::
() TBD
// LEAVE UNDOCUMENTED
// `use_per_partition_normalization`::
// () TBD
[[ml-detectorconfig]]
===== Detector Configuration Objects
@ -134,10 +137,11 @@ Each detector has the following properties:
It is used for finding unusual values in the context of the split.
`detector_description`::
(string) A description of the detector. For example, `low_sum(events_per_min)`.
(string) A description of the detector. For example, `Low event rate`.
`detector_rules`::
(array) TBD
// LEAVE UNDOCUMENTED
// `detector_rules`::
// (array) TBD
`exclude_frequent`::
(string) Contains one of the following values: `all`, `none`, `by`, or `over`.
@ -152,19 +156,12 @@ Each detector has the following properties:
+
--
NOTE: The `field_name` cannot contain double quotes or backslashes.
--
`function` (required)::
(string) The analysis function that is used.
For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`.
The default function is `metric`, which looks for anomalies in all of `min`, `max`,
and `mean`. +
+
--
NOTE: You cannot use the `metric` function with pre-summarized input. If `summary_count_field_name`
is not null, you must specify a function other than `metric`.
--
`over_field_name`::
(string) The field used to split the data.
In particular, this property is used for analyzing the splits with respect to the history of all splits.
@ -180,33 +177,21 @@ NOTE: You cannot use the `metric` function with pre-summarized input. If `summar
+
--
IMPORTANT: Field names are case sensitive, for example a field named 'Bytes' is different to one named 'bytes'.
--
[[ml-datadescription]]
===== Data Description Objects
The data description settings define the format of the input data.
When data is read from Elasticsearch, the datafeed must be configured.
This defines which index data will be taken from, and over what time period.
The data description define the format of the input data when posting time-ordered data to the `_data` endpoint.
Please note that when configuring a datafeed, these are automatically set.
When data is received via the <<ml-post-data, Post Data to Jobs>> API,
you must specify the data format (for example, JSON or CSV). In this scenario,
the data posted is not stored in Elasticsearch. Only the results for anomaly detection are retained.
When you create a job, by default it accepts data in tab-separated-values format and expects
an Epoch time value in a field named `time`. The `time` field must be measured in seconds from the Epoch.
If, however, your data is not in this format, you can provide a data description object that specifies the
format of your data.
A data description object has the following properties:
`fieldDelimiter`::
() TBD
`format`::
() TBD
(string) Only `JSON` format is supported at this time.
`time_field`::
(string) The name of the field that contains the timestamp.
@ -215,23 +200,22 @@ A data description object has the following properties:
`time_format`::
(string) The time format, which can be `epoch`, `epoch_ms`, or a custom pattern.
The default value is `epoch`, which refers to UNIX or Epoch time (the number of seconds
since 1 Jan 1970) and corresponds to the time_t type in C and C++.
since 1 Jan 1970).
The value `epoch_ms` indicates that time is measured in milliseconds since the epoch.
The `epoch` and `epoch_ms` time formats accept either integer or real values. +
+
--
NOTE: Custom patterns must conform to the Java `DateTimeFormatter` class. When you use date-time formatting patterns, it is recommended that you provide the full date, time and time zone. For example: `yyyy-MM-dd'T'HH:mm:ssX`. If the pattern that you specify is not sufficient to produce a complete timestamp, job creation fails.
--
`quotecharacter`::
() TBD
[[ml-apilimits]]
===== Analysis Limits
Limits can be applied for the size of the mathematical models that are held in memory.
These limits can be set per job and do not control the memory used by other processes.
If necessary, the limits can also be updated after the job is created.
Limits can be applied for the resources required to hold the mathematical models in memory.
These limits are approximate and can be set per job.
They do not control the memory used by other processes, for example the elasticsearch Java processes.
If necessary, the limits can be increased after the job is created.
The `analysis_limits` object has the following properties:
@ -241,10 +225,33 @@ The `analysis_limits` object has the following properties:
more examples are available, however it requires that you have more storage available.
If you set this value to `0`, no examples are stored.
////
NOTE: The `categorization_examples_limit` only applies to analysis that uses categorization.
////
`model_memory_limit`::
(long) The maximum amount of memory, in MiB, that the mathematical models can use.
Once this limit is approached, data pruning becomes more aggressive.
Upon exceeding this limit, new entities are not modeled. The default value is 4096.
[[ml-apimodelplotconfig]]
===== Model Plot Config
This advanced configuration option will store model information along with results allowing a more detailed view into anomaly detection.
Enabling this can add considerable overhead to the performance of the system and is not feasible for jobs with many entities.
Model plot provides a simplified and indicative view of the model and its bounds.
It does not display complex features such as multivariate correlations or multimodal data.
As such, anomalies may occassionally be reported which cannot be seen in the model plot.
Model plot config can be configured when the job is created or updated later. It must be disabled if performance issues are experienced.
The `model_plot_config` object has the following properties:
`enabled`::
(boolean) If true, will enable calculation and storage of the model bounds for each entity being analyzed.
By default, this is not enabled.
`terms`::
(string) Limits data collection to this comma separated list of _partition_ or _by_ field names.
If terms are not specified or is an empty string, no filtering is applied.
E.g. `"CPU,NetworkIn,DiskWrites"`

View File

@ -88,4 +88,4 @@ For example:
}
----
For more information about these properties, see <<ml-jobcounts,Job Counts>>.
For more information about these properties, see <<ml-jobstats,Job Stats>>.