[DOCS] Fixes build and typo issues in ML API update elastic/x-pack-elasticsearch#1118

Original commit: elastic/x-pack-elasticsearch@3ef20792ac
This commit is contained in:
Lisa Cawley 2017-04-19 13:31:07 -07:00 committed by GitHub
parent 7656e4a67b
commit ded8edcc3d
2 changed files with 158 additions and 108 deletions

View File

@ -1,31 +1,32 @@
//lcawley Verified example output 2017-04-11
[[ml-jobstats]]
==== Job Stats
==== Job Statistics
The get job statistics API provides information about the operational
progress of a job.
`assignment_explanation`::
(string) For open jobs only, contains messages relating to the selection of an executing node.
(string) For open jobs only, contains messages relating to the selection
of a node to run the job.
`data_counts`::
(object) An object that describes the number of records processed and any related error counts.
See <<ml-datacounts,data counts objects>>.
(object) An object that describes the number of records processed and
any related error counts. See <<ml-datacounts,data counts objects>>.
`job_id`::
(string) A numerical character string that uniquely identifies the job.
(string) A unique identifier for the job.
`model_size_stats`::
(object) An object that provides information about the size and contents of the model.
See <<ml-modelsizestats,model size stats objects>>
`node`::
(object) For open jobs only, contains information about the executing node.
See <<ml-stats-node,node object>>.
(object) For open jobs only, contains information about the node where the
job runs. See <<ml-stats-node,node object>>.
`open_time`::
(string) For open jobs only, the elapsed time for which the job has been open.
E.g. `28746386s`.
For example, `28746386s`.
`state`::
(string) The status of the job, which can be one of the following values:
@ -36,16 +37,16 @@ progress of a job.
`closing`::: The job close action is in progress and has not yet completed.
A closing job cannot accept further data.
`failed`::: The job did not finish successfully due to an error.
This situation can occur due to invalid input data.
If the job had irrevecobly failed, it must be force closed and then deleted.
If the datafeed can be corrected, the job can be closed and then re-opened.
This situation can occur due to invalid input data.
If the job had irrevocably failed, it must be force closed and then deleted.
If the data feed can be corrected, the job can be closed and then re-opened.
[float]
[[ml-datacounts]]
===== Data Counts Objects
The `data_counts` object describes the number of records processed
and any related error counts.
and any related error counts.
The `data_count` values are cumulative for the lifetime of a job. If a model snapshot is reverted
or old results are deleted, the job counts are not reset.
@ -58,8 +59,8 @@ or old results are deleted, the job counts are not reset.
The datetime string is in ISO 8601 format.
`empty_bucket_count`::
(long) The number of buckets which did not contain any data. If your data contains many
empty buckets, consider increasing your `bucket_span` or using functions that are tolerant
(long) The number of buckets which did not contain any data. If your data contains many
empty buckets, consider increasing your `bucket_span` or using functions that are tolerant
to gaps in data such as `mean`, `non_null_sum` or `non_zero_count`.
`input_bytes`::
@ -76,10 +77,10 @@ or old results are deleted, the job counts are not reset.
(long) The number of records with either a missing date field or a date that could not be parsed.
`job_id`::
(string) A numerical character string that uniquely identifies the job.
(string) A unique identifier for the job.
`last_data_time`::
(datetime) The timestamp at which data was last analyzed, according to server time.
(datetime) The timestamp at which data was last analyzed, according to server time.
`latest_empty_bucket_timestamp`::
(date) The timestamp of the last bucket that did not contain any data.
@ -91,9 +92,10 @@ or old results are deleted, the job counts are not reset.
(date) The timestamp of the last bucket that was considered sparse.
`missing_field_count`::
(long) The number of records that are missing a field that the job is configured to analyze.
Records with missing fields are still processed because it is possible that not all fields are missing.
The value of `processed_record_count` includes this count. +
(long) The number of records that are missing a field that the job is
configured to analyze. Records with missing fields are still processed because
it is possible that not all fields are missing. The value of
`processed_record_count` includes this count. +
+
--
NOTE: If you are using data feeds or posting data to the job in JSON format, a
@ -103,29 +105,35 @@ necessarily a cause for concern.
--
`out_of_order_timestamp_count`::
(long) The number of records that are out of time sequence and outside of the latency window.
This is only applicable when using the `_data` endpoint.
These records are discarded, since jobs require time series data to be in ascending chronological order.
(long) The number of records that are out of time sequence and
outside of the latency window. This information is applicable only when
you provide data to the job by using the <<ml-post-data,post data API>>.
These out of order records are discarded, since jobs require time series data
to be in ascending chronological order.
`processed_field_count`::
(long) The total number of fields in all the records that have been processed by the job.
Only fields that are specified in the detector configuration object contribute to this count.
The time stamp is not included in this count.
(long) The total number of fields in all the records that have been processed
by the job. Only fields that are specified in the detector configuration
object contribute to this count. The time stamp is not included in this count.
`processed_record_count`::
(long) The number of records that have been processed by the job.
This value includes records with missing fields, since they are nonetheless analyzed.
+
When using datafeeds, the `processed_record_count` will differ from the `input_record_count`
if you are using aggregations in your search query.
+
When posting to the `/_data` endpoint, the following records are not processed:
* Records not in chronological order and outside the latency window
* Records with invalid timestamp
This value includes records with missing fields, since they are nonetheless
analyzed. +
If you use data feeds and have aggregations in your search query,
the `processed_record_count` differs from the `input_record_count`. +
If you use the <<ml-post-data,post data API>> to provide data to the job,
the following records are not processed: +
+
--
* Records not in chronological order and outside the latency window
* Records with invalid timestamp
--
`sparse_bucket_count`::
(long) The number of buckets which contained few data points compared to the expected number
of data points. If your data contains many sparse buckets, consider using a longer `bucket_span`.
(long) The number of buckets that contained few data points compared to the
expected number of data points. If your data contains many sparse buckets,
consider using a longer `bucket_span`.
[float]
[[ml-modelsizestats]]
@ -134,8 +142,9 @@ necessarily a cause for concern.
The `model_size_stats` object has the following properties:
`bucket_allocation_failures_count`::
(long) The number of buckets for which new entites in incoming data was not processed due to a
insufficient model memory as signified by a `hard_limit` `memory_status`.
(long) The number of buckets for which new entities in incoming data were not
processed due to insufficient model memory. This situation is also signified
by a `hard_limit: memory_status` property value.
`job_id`::
(string) A numerical character string that uniquely identifies the job.
@ -144,14 +153,18 @@ The `model_size_stats` object has the following properties:
(date) The timestamp of the `model_size_stats` according to server time.
`memory_status`::
(string) The status of the mathematical models. This property can have one of the following values:
(string) The status of the mathematical models.
This property can have one of the following values:
`ok`::: The models stayed below the configured value.
`soft_limit`::: The models used more than 60% of the configured memory limit and older unused models will be pruned to free up space.
`hard_limit`::: The models used more space than the configured memory limit. As a result, not all incoming data was processed.
`soft_limit`::: The models used more than 60% of the configured memory limit
and older unused models will be pruned to free up space.
`hard_limit`::: The models used more space than the configured memory limit.
As a result, not all incoming data was processed.
`model_bytes`::
(long) The number of bytes of memory used by the models. This is the maximum value since the
last time the model was persisted. If the job is closed, this value indicates the latest size.
(long) The number of bytes of memory used by the models. This is the maximum
value since the last time the model was persisted. If the job is closed,
this value indicates the latest size.
`result_type`::
(string) For internal use. The type of result.
@ -161,6 +174,7 @@ The `model_size_stats` object has the following properties:
+
--
NOTE: The `by` field values are counted separately for each detector and partition.
--
`total_over_field_count`::
@ -168,6 +182,7 @@ NOTE: The `by` field values are counted separately for each detector and partiti
+
--
NOTE: The `over` field values are counted separately for each detector and partition.
--
`total_partition_field_count`::
@ -180,20 +195,22 @@ NOTE: The `over` field values are counted separately for each detector and parti
[[ml-stats-node]]
===== Node Objects
The `node` objects contains properties of the executing node and is only available for open jobs.
The `node` objects contains properties for the node that runs the job.
This information is available only for open jobs.
`id`::
(string) The unique identifier of the executing node.
(string) The unique identifier of the node.
`name`::
(string) The node's name.
(string) The node name.
`ephemeral_id`::
`transport_address`::
(string) Host and port where transport HTTP connections are accepted.
(string) The host and port where transport HTTP connections are accepted.
`attributes`::
(object) {ml} attributes.
`max_running_jobs`::: The maximum number of concurrently open jobs allowed per node.
`max_running_jobs`::: The maximum number of concurrently open jobs that are
allowed per node.

View File

@ -5,7 +5,7 @@
A job resource has the following properties:
`analysis_config`::
(object) The analysis configuration, which specifies how to analyze the data.
(object) The analysis configuration, which specifies how to analyze the data.
See <<ml-analysisconfig, analysis configuration objects>>.
`analysis_limits`::
@ -13,16 +13,19 @@ A job resource has the following properties:
See <<ml-apilimits,analysis limits>>.
`create_time`::
(string) The time the job was created, in ISO 8601 format. For example, `1491007356077`.
(string) The time the job was created, in ISO 8601 format.
For example, `1491007356077`.
`data_description`::
(object) Describes the data format and how APIs parse timestamp fields. See <<ml-datadescription,data description objects>>.
(object) Describes the data format and how APIs parse timestamp fields.
See <<ml-datadescription,data description objects>>.
`description`::
(string) An optional description of the job.
`finished_time`::
(string) If the job closed or failed, this is the time the job finished, otherwise it is `null`.
(string) If the job closed or failed, this is the time the job finished,
otherwise it is `null`.
`job_id`::
(string) The unique identifier for the job.
@ -31,7 +34,7 @@ A job resource has the following properties:
(string) Reserved for future use, currently set to `anomaly_detector`.
`model_plot_config`::
(object) Configuration properties for storing additional model information.
(object) Configuration properties for storing additional model information.
See <<ml-apimodelplotconfig, model plot configuration>>.
`model_snapshot_id`::
@ -43,8 +46,9 @@ A job resource has the following properties:
Older snapshots are deleted. The default value is 1 day.
`results_index_name`::
(string) The name of the index in which to store the results generated by {ml} results.
The default value is `shared` which corresponds to the index name `.ml-anomalies-shared`
(string) The name of the index in which to store the {ml} results.
The default value is `shared`,
which corresponds to the index name `.ml-anomalies-shared`
[[ml-analysisconfig]]
===== Analysis Configuration Objects
@ -52,8 +56,8 @@ A job resource has the following properties:
An analysis configuration object has the following properties:
`bucket_span` (required)::
(time units) The size of the interval that the analysis is aggregated into, typically between `5m` and `1h`.
The default value is `5m`.
(time units) The size of the interval that the analysis is aggregated into,
typically between `5m` and `1h`. The default value is `5m`.
`categorization_field_name`::
(string) If not null, the values of the specified field will be categorized.
@ -61,12 +65,13 @@ An analysis configuration object has the following properties:
`over_field_name`, or `partition_field_name` to the keyword `prelertcategory`.
`categorization_filters`::
(array of strings) If `categorization_field_name` is specified, you can also define optional filters.
This property expects an array of regular expressions.
The expressions are used to filter out matching sequences off the categorization field values.
This functionality is useful to fine tune categorization by excluding sequences
that should not be taken into consideration for defining categories.
For example, you can exclude SQL statements that appear in your log files.
(array of strings) If `categorization_field_name` is specified,
you can also define optional filters. This property expects an array of
regular expressions. The expressions are used to filter out matching sequences
off the categorization field values. This functionality is useful to fine tune
categorization by excluding sequences that should not be taken into
consideration for defining categories. For example, you can exclude SQL
statements that appear in your log files.
`detectors` (required)::
(array) An array of detector configuration objects,
@ -80,16 +85,19 @@ and an error is returned.
--
`influencers`::
(array of strings) A comma separated list of influencer field names.
Typically these can be the by, over, or partition fields that are used in the detector configuration.
You might also want to use a field name that is not specifically named in a detector,
but is available as part of the input data. When you use multiple detectors,
the use of influencers is recommended as it aggregates results for each influencer entity.
Typically these can be the by, over, or partition fields that are used in the
detector configuration. You might also want to use a field name that is not
specifically named in a detector, but is available as part of the input data.
When you use multiple detectors, the use of influencers is recommended as it
aggregates results for each influencer entity.
`latency`::
(unsigned integer) The size of the window, in seconds, in which to expect data that is out of time order.
The default value is 0 (no latency).
NOTE: Latency is only applicable when you send data by using the <<ml-post-data, Post Data to Jobs>> API.
(unsigned integer) The size of the window, in seconds, in which to expect data
that is out of time order. The default value is 0 (no latency). +
+
--
NOTE: Latency is only applicable when you send data by using
the <<ml-post-data,post data>> API.
--
`multivariate_by_fields`::
@ -103,26 +111,32 @@ NOTE: Latency is only applicable when you send data by using the <<ml-post-data,
That is to say, you'll see an anomaly when the CPU of host A is unusual given the CPU of host B. +
+
--
NOTE: To use the `multivariate_by_fields` property, you must also specify `by_field_name` in your detector.
NOTE: To use the `multivariate_by_fields` property, you must also specify
`by_field_name` in your detector.
// LEAVE UNDOCUMENTED
// `overlapping_buckets`::
// (boolean) If set to `true`, an additional analysis occurs that runs out of phase by half a bucket length.
// This requires more system resources and enhances detection of anomalies that span bucket boundaries.
--
`summary_count_field_name`::
(string) If not null, the data fed to the job is expected to be pre-summarized.
This property value is the name of the field that contains the count of raw data points that have been summarized.
The same `summary_count_field_name` applies to all detectors in the job. +
This property value is the name of the field that contains the count of raw
data points that have been summarized. The same `summary_count_field_name`
applies to all detectors in the job. +
+
--
NOTE: The `summary_count_field_name` property cannot be used with the `metric` function.
--
// LEAVE UNDOCUMENTED
// `use_per_partition_normalization`::
// () TBD
////
LEAVE UNDOCUMENTED
`overlapping_buckets`::
(boolean) If set to `true`, an additional analysis occurs that runs out of phase by half a bucket length.
This requires more system resources and enhances detection of anomalies that span bucket boundaries.
`use_per_partition_normalization`::
() TBD
////
[float]
[[ml-detectorconfig]]
===== Detector Configuration Objects
@ -139,10 +153,6 @@ Each detector has the following properties:
`detector_description`::
(string) A description of the detector. For example, `Low event rate`.
// LEAVE UNDOCUMENTED
// `detector_rules`::
// (array) TBD
`exclude_frequent`::
(string) Contains one of the following values: `all`, `none`, `by`, or `over`.
If set, frequent entities are excluded from influencing the anomaly results.
@ -156,6 +166,7 @@ Each detector has the following properties:
+
--
NOTE: The `field_name` cannot contain double quotes or backslashes.
--
`function` (required)::
@ -176,22 +187,32 @@ NOTE: The `field_name` cannot contain double quotes or backslashes.
when there is no value for the by or partition fields. The default value is `false`. +
+
--
IMPORTANT: Field names are case sensitive, for example a field named 'Bytes' is different to one named 'bytes'.
IMPORTANT: Field names are case sensitive, for example a field named 'Bytes'
is different from one named 'bytes'.
--
////
LEAVE UNDOCUMENTED
`detector_rules`::
(array) TBD
////
[float]
[[ml-datadescription]]
===== Data Description Objects
The data description define the format of the input data when posting time-ordered data to the `_data` endpoint.
Please note that when configuring a datafeed, these are automatically set.
The data description defines the format of the input data when you send data to
the job by using the <<ml-post-data,post data>> API. Note that when configure
a data feed, these properties are automatically set.
When data is received via the <<ml-post-data, Post Data to Jobs>> API,
the data posted is not stored in Elasticsearch. Only the results for anomaly detection are retained.
When data is received via the <<ml-post-data,post data>> API, it is not stored
in Elasticsearch. Only the results for anomaly detection are retained.
A data description object has the following properties:
`format`::
(string) Only `JSON` format is supported at this time.
(string) Only `JSON` format is supported at this time.
`time_field`::
(string) The name of the field that contains the timestamp.
@ -205,17 +226,23 @@ A data description object has the following properties:
The `epoch` and `epoch_ms` time formats accept either integer or real values. +
+
--
NOTE: Custom patterns must conform to the Java `DateTimeFormatter` class. When you use date-time formatting patterns, it is recommended that you provide the full date, time and time zone. For example: `yyyy-MM-dd'T'HH:mm:ssX`. If the pattern that you specify is not sufficient to produce a complete timestamp, job creation fails.
NOTE: Custom patterns must conform to the Java `DateTimeFormatter` class.
When you use date-time formatting patterns, it is recommended that you provide
the full date, time and time zone. For example: `yyyy-MM-dd'T'HH:mm:ssX`.
If the pattern that you specify is not sufficient to produce a complete timestamp,
job creation fails.
--
[float]
[[ml-apilimits]]
===== Analysis Limits
Limits can be applied for the resources required to hold the mathematical models in memory.
These limits are approximate and can be set per job.
They do not control the memory used by other processes, for example the elasticsearch Java processes.
If necessary, the limits can be increased after the job is created.
These limits are approximate and can be set per job. They do not control the
memory used by other processes, for example the Elasticsearch Java processes.
If necessary, you can increase the limits after the job is created.
The `analysis_limits` object has the following properties:
@ -223,35 +250,41 @@ The `analysis_limits` object has the following properties:
(long) The maximum number of examples stored per category in memory and
in the results data store. The default value is 4. If you increase this value,
more examples are available, however it requires that you have more storage available.
If you set this value to `0`, no examples are stored.
If you set this value to `0`, no examples are stored. +
+
--
NOTE: The `categorization_examples_limit` only applies to analysis that uses categorization.
--
`model_memory_limit`::
(long) The maximum amount of memory, in MiB, that the mathematical models can use.
Once this limit is approached, data pruning becomes more aggressive.
Upon exceeding this limit, new entities are not modeled. The default value is 4096.
[float]
[[ml-apimodelplotconfig]]
===== Model Plot Config
This advanced configuration option will store model information along with results allowing a more detailed view into anomaly detection.
Enabling this can add considerable overhead to the performance of the system and is not feasible for jobs with many entities.
This advanced configuration option stores model information along with the
results. It provides a more detailed view into anomaly detection. If you enable
this option, it can add considerable overhead to the performance of the system;
it is not feasible for jobs with many entities.
Model plot provides a simplified and indicative view of the model and its bounds.
It does not display complex features such as multivariate correlations or multimodal data.
As such, anomalies may occassionally be reported which cannot be seen in the model plot.
Model plot provides a simplified and indicative view of the model and its bounds.
It does not display complex features such as multivariate correlations or multimodal data.
As such, anomalies may occasionally be reported which cannot be seen in the model plot.
Model plot config can be configured when the job is created or updated later. It must be disabled if performance issues are experienced.
Model plot config can be configured when the job is created or updated later.
It must be disabled if performance issues are experienced.
The `model_plot_config` object has the following properties:
`enabled`::
(boolean) If true, will enable calculation and storage of the model bounds for each entity being analyzed.
By default, this is not enabled.
(boolean) If true, enables calculation and storage of the model bounds for
each entity that is being analyzed. By default, this is not enabled.
`terms`::
(string) Limits data collection to this comma separated list of _partition_ or _by_ field names.
If terms are not specified or is an empty string, no filtering is applied.
E.g. `"CPU,NetworkIn,DiskWrites"`
(string) Limits data collection to this comma separated list of _partition_
or _by_ field names. If terms are not specified or it is an empty string,
no filtering is applied. For example, `"CPU,NetworkIn,DiskWrites"`