489 lines
20 KiB
Plaintext
489 lines
20 KiB
Plaintext
[role="xpack"]
|
|
[[ml-job-resource]]
|
|
=== Job Resources
|
|
|
|
A job resource has the following properties:
|
|
|
|
`analysis_config`::
|
|
(object) The analysis configuration, which specifies how to analyze the data.
|
|
See <<ml-analysisconfig, analysis configuration objects>>.
|
|
|
|
`analysis_limits`::
|
|
(object) Defines approximate limits on the memory resource requirements for the job.
|
|
See <<ml-apilimits,analysis limits>>.
|
|
|
|
`background_persist_interval`::
|
|
(time units) Advanced configuration option.
|
|
The time between each periodic persistence of the model.
|
|
The default value is a randomized value between 3 to 4 hours, which avoids
|
|
all jobs persisting at exactly the same time. The smallest allowed value is
|
|
1 hour.
|
|
+
|
|
--
|
|
TIP: For very large models (several GB), persistence could take 10-20 minutes,
|
|
so do not set the `background_persist_interval` value too low.
|
|
|
|
--
|
|
|
|
`create_time`::
|
|
(string) The time the job was created. For example, `1491007356077`. This
|
|
property is informational; you cannot change its value.
|
|
|
|
`custom_settings`::
|
|
(object) Advanced configuration option. Contains custom meta data about the
|
|
job. For example, it can contain custom URL information as shown in
|
|
{xpack-ref}/ml-configuring-url.html[Adding Custom URLs to Machine Learning Results].
|
|
|
|
`data_description`::
|
|
(object) Describes the data format and how APIs parse timestamp fields.
|
|
See <<ml-datadescription,data description objects>>.
|
|
|
|
`description`::
|
|
(string) An optional description of the job.
|
|
|
|
`established_model_memory`::
|
|
(long) The approximate amount of memory resources that have been used for
|
|
analytical processing. This field is present only when the analytics have used
|
|
a stable amount of memory for several consecutive buckets.
|
|
|
|
`finished_time`::
|
|
(string) If the job closed or failed, this is the time the job finished,
|
|
otherwise it is `null`. This property is informational; you cannot change its
|
|
value.
|
|
|
|
`groups`::
|
|
(array of strings) A list of job groups. A job can belong to no groups or
|
|
many. For example, `["group1", "group2"]`.
|
|
|
|
`job_id`::
|
|
(string) The unique identifier for the job. This identifier can contain
|
|
lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It
|
|
must start and end with alphanumeric characters. This property is
|
|
informational; you cannot change the identifier for existing jobs.
|
|
|
|
`job_type`::
|
|
(string) Reserved for future use, currently set to `anomaly_detector`.
|
|
|
|
`job_version`::
|
|
(string) The version of {es} that existed on the node when the job was created.
|
|
|
|
`model_plot_config`::
|
|
(object) Configuration properties for storing additional model information.
|
|
See <<ml-apimodelplotconfig, model plot configuration>>.
|
|
|
|
`model_snapshot_id`::
|
|
(string) A numerical character string that uniquely identifies the model
|
|
snapshot. For example, `1491007364`. This property is informational; you
|
|
cannot change its value. For more information about model snapshots, see
|
|
<<ml-snapshot-resource>>.
|
|
|
|
`model_snapshot_retention_days`::
|
|
(long) The time in days that model snapshots are retained for the job.
|
|
Older snapshots are deleted. The default value is `1`, which means snapshots
|
|
are retained for one day (twenty-four hours).
|
|
|
|
`renormalization_window_days`::
|
|
(long) Advanced configuration option.
|
|
The period over which adjustments to the score are applied, as new data is seen.
|
|
The default value is the longer of 30 days or 100 `bucket_spans`.
|
|
|
|
`results_index_name`::
|
|
(string) The name of the index in which to store the {ml} results.
|
|
The default value is `shared`,
|
|
which corresponds to the index name `.ml-anomalies-shared`
|
|
|
|
`results_retention_days`::
|
|
(long) Advanced configuration option.
|
|
The number of days for which job results are retained.
|
|
Once per day at 00:30 (server time), results older than this period are
|
|
deleted from Elasticsearch. The default value is null, which means results
|
|
are retained.
|
|
|
|
[[ml-analysisconfig]]
|
|
==== Analysis Configuration Objects
|
|
|
|
An analysis configuration object has the following properties:
|
|
|
|
`bucket_span`::
|
|
(time units) The size of the interval that the analysis is aggregated into,
|
|
typically between `5m` and `1h`. The default value is `5m`.
|
|
|
|
`categorization_field_name`::
|
|
(string) If this property is specified, the values of the specified field will
|
|
be categorized. The resulting categories must be used in a detector by setting
|
|
`by_field_name`, `over_field_name`, or `partition_field_name` to the keyword
|
|
`mlcategory`. For more information, see
|
|
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
|
|
|
`categorization_filters`::
|
|
(array of strings) If `categorization_field_name` is specified,
|
|
you can also define optional filters. This property expects an array of
|
|
regular expressions. The expressions are used to filter out matching sequences
|
|
from the categorization field values. You can use this functionality to fine
|
|
tune the categorization by excluding sequences from consideration when
|
|
categories are defined. For example, you can exclude SQL statements that
|
|
appear in your log files. For more information, see
|
|
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
|
This property cannot be used at the same time as `categorization_analyzer`.
|
|
If you only want to define simple regular expression filters that are applied
|
|
prior to tokenization, setting this property is the easiest method.
|
|
If you also want to customize the tokenizer or post-tokenization filtering,
|
|
use the `categorization_analyzer` property instead and include the filters as
|
|
`pattern_replace` character filters. The effect is exactly the same.
|
|
|
|
`categorization_analyzer`::
|
|
(object or string) If `categorization_field_name` is specified, you can also
|
|
define the analyzer that is used to interpret the categorization field. This
|
|
property cannot be used at the same time as `categorization_filters`. See
|
|
<<ml-categorizationanalyzer,categorization analyzer>>.
|
|
|
|
`detectors`::
|
|
(array) An array of detector configuration objects,
|
|
which describe the anomaly detectors that are used in the job.
|
|
See <<ml-detectorconfig,detector configuration objects>>. +
|
|
+
|
|
--
|
|
NOTE: If the `detectors` array does not contain at least one detector,
|
|
no analysis can occur and an error is returned.
|
|
|
|
--
|
|
|
|
`influencers`::
|
|
(array of strings) A comma separated list of influencer field names.
|
|
Typically these can be the by, over, or partition fields that are used in the
|
|
detector configuration. You might also want to use a field name that is not
|
|
specifically named in a detector, but is available as part of the input data.
|
|
When you use multiple detectors, the use of influencers is recommended as it
|
|
aggregates results for each influencer entity.
|
|
|
|
`latency`::
|
|
(time units) The size of the window in which to expect data that is out of
|
|
time order. The default value is 0 (no latency). If you specify a non-zero
|
|
value, it must be greater than or equal to one second. For more information
|
|
about time units, see
|
|
{ref}/common-options.html#time-units[Time Units].
|
|
+
|
|
--
|
|
NOTE: Latency is only applicable when you send data by using
|
|
the <<ml-post-data,post data>> API.
|
|
|
|
--
|
|
|
|
`multivariate_by_fields`::
|
|
(boolean) This functionality is reserved for internal use. It is not supported
|
|
for use in customer environments and is not subject to the support SLA of
|
|
official GA features.
|
|
+
|
|
--
|
|
If set to `true`, the analysis will automatically find correlations
|
|
between metrics for a given `by` field value and report anomalies when those
|
|
correlations cease to hold. For example, suppose CPU and memory usage on host A
|
|
is usually highly correlated with the same metrics on host B. Perhaps this
|
|
correlation occurs because they are running a load-balanced application.
|
|
If you enable this property, then anomalies will be reported when, for example,
|
|
CPU usage on host A is high and the value of CPU usage on host B is low.
|
|
That is to say, you'll see an anomaly when the CPU of host A is unusual given
|
|
the CPU of host B.
|
|
|
|
NOTE: To use the `multivariate_by_fields` property, you must also specify
|
|
`by_field_name` in your detector.
|
|
|
|
--
|
|
|
|
`summary_count_field_name`::
|
|
(string) If this property is specified, the data that is fed to the job is
|
|
expected to be pre-summarized. This property value is the name of the field
|
|
that contains the count of raw data points that have been summarized. The same
|
|
`summary_count_field_name` applies to all detectors in the job.
|
|
+
|
|
--
|
|
|
|
NOTE: The `summary_count_field_name` property cannot be used with the `metric`
|
|
function.
|
|
|
|
--
|
|
|
|
After you create a job, you cannot change the analysis configuration object; all
|
|
the properties are informational.
|
|
|
|
[float]
|
|
[[ml-detectorconfig]]
|
|
==== Detector Configuration Objects
|
|
|
|
Detector configuration objects specify which data fields a job analyzes.
|
|
They also specify which analytical functions are used.
|
|
You can specify multiple detectors for a job.
|
|
Each detector has the following properties:
|
|
|
|
`by_field_name`::
|
|
(string) The field used to split the data.
|
|
In particular, this property is used for analyzing the splits with respect to their own history.
|
|
It is used for finding unusual values in the context of the split.
|
|
|
|
`detector_description`::
|
|
(string) A description of the detector. For example, `Low event rate`.
|
|
|
|
`detector_index`::
|
|
(integer) A unique identifier for the detector. This identifier is based on
|
|
the order of the detectors in the `analysis_config`, starting at zero. You can
|
|
use this identifier when you want to update a specific detector.
|
|
|
|
`exclude_frequent`::
|
|
(string) Contains one of the following values: `all`, `none`, `by`, or `over`.
|
|
If set, frequent entities are excluded from influencing the anomaly results.
|
|
Entities can be considered frequent over time or frequent in a population.
|
|
If you are working with both over and by fields, then you can set `exclude_frequent`
|
|
to `all` for both fields, or to `by` or `over` for those specific fields.
|
|
|
|
`field_name`::
|
|
(string) The field that the detector uses in the function. If you use an event rate
|
|
function such as `count` or `rare`, do not specify this field. +
|
|
+
|
|
--
|
|
NOTE: The `field_name` cannot contain double quotes or backslashes.
|
|
|
|
--
|
|
|
|
`function`::
|
|
(string) The analysis function that is used.
|
|
For example, `count`, `rare`, `mean`, `min`, `max`, and `sum`. For more
|
|
information, see {xpack-ref}/ml-functions.html[Function Reference].
|
|
|
|
`over_field_name`::
|
|
(string) The field used to split the data.
|
|
In particular, this property is used for analyzing the splits with respect to
|
|
the history of all splits. It is used for finding unusual values in the
|
|
population of all splits. For more information, see
|
|
{xpack-ref}/ml-configuring-pop.html[Performing Population Analysis].
|
|
|
|
`partition_field_name`::
|
|
(string) The field used to segment the analysis.
|
|
When you use this property, you have completely independent baselines for each value of this field.
|
|
|
|
`use_null`::
|
|
(boolean) Defines whether a new series is used as the null series
|
|
when there is no value for the by or partition fields. The default value is `false`. +
|
|
+
|
|
--
|
|
IMPORTANT: Field names are case sensitive, for example a field named 'Bytes'
|
|
is different from one named 'bytes'.
|
|
|
|
--
|
|
|
|
After you create a job, the only property you can change in the detector
|
|
configuration object is the `detector_description`; all other properties are
|
|
informational.
|
|
|
|
[float]
|
|
[[ml-datadescription]]
|
|
==== Data Description Objects
|
|
|
|
The data description defines the format of the input data when you send data to
|
|
the job by using the <<ml-post-data,post data>> API. Note that when configure
|
|
a {dfeed}, these properties are automatically set.
|
|
|
|
When data is received via the <<ml-post-data,post data>> API, it is not stored
|
|
in {es}. Only the results for anomaly detection are retained.
|
|
|
|
A data description object has the following properties:
|
|
|
|
`format`::
|
|
(string) Only `JSON` format is supported at this time.
|
|
|
|
`time_field`::
|
|
(string) The name of the field that contains the timestamp.
|
|
The default value is `time`.
|
|
|
|
`time_format`::
|
|
(string) The time format, which can be `epoch`, `epoch_ms`, or a custom pattern.
|
|
The default value is `epoch`, which refers to UNIX or Epoch time (the number of seconds
|
|
since 1 Jan 1970).
|
|
The value `epoch_ms` indicates that time is measured in milliseconds since the epoch.
|
|
The `epoch` and `epoch_ms` time formats accept either integer or real values. +
|
|
+
|
|
--
|
|
NOTE: Custom patterns must conform to the Java `DateTimeFormatter` class.
|
|
When you use date-time formatting patterns, it is recommended that you provide
|
|
the full date, time and time zone. For example: `yyyy-MM-dd'T'HH:mm:ssX`.
|
|
If the pattern that you specify is not sufficient to produce a complete timestamp,
|
|
job creation fails.
|
|
|
|
--
|
|
|
|
[float]
|
|
[[ml-categorizationanalyzer]]
|
|
==== Categorization Analyzer
|
|
|
|
The categorization analyzer specifies how the `categorization_field` is
|
|
interpreted by the categorization process. The syntax is very similar to that
|
|
used to define the `analyzer` in the <<indices-analyze,Analyze endpoint>>.
|
|
|
|
The `categorization_analyzer` field can be specified either as a string or as
|
|
an object.
|
|
|
|
If it is a string it must refer to a <<analysis-analyzers,built-in analyzer>> or
|
|
one added by another plugin.
|
|
|
|
If it is an object it has the following properties:
|
|
|
|
`char_filter`::
|
|
(array of strings or objects) One or more
|
|
<<analysis-charfilters,character filters>>. In addition to the built-in
|
|
character filters, other plugins can provide more character filters. This
|
|
property is optional. If it is not specified, no character filters are applied
|
|
prior to categorization. If you are customizing some other aspect of the
|
|
analyzer and you need to achieve the equivalent of `categorization_filters`
|
|
(which are not permitted when some other aspect of the analyzer is customized),
|
|
add them here as
|
|
<<analysis-pattern-replace-charfilter,pattern replace character filters>>.
|
|
|
|
`tokenizer`::
|
|
(string or object) The name or definition of the
|
|
<<analysis-tokenizers,tokenizer>> to use after character filters are applied.
|
|
This property is compulsory if `categorization_analyzer` is specified as an
|
|
object. Machine learning provides a tokenizer called `ml_classic` that
|
|
tokenizes in the same way as the non-customizable tokenizer in older versions
|
|
of the product. If you want to use that tokenizer but change the character or
|
|
token filters, specify `"tokenizer": "ml_classic"` in your
|
|
`categorization_analyzer`.
|
|
|
|
`filter`::
|
|
(array of strings or objects) One or more
|
|
<<analysis-tokenfilters,token filters>>. In addition to the built-in token
|
|
filters, other plugins can provide more token filters. This property is
|
|
optional. If it is not specified, no token filters are applied prior to
|
|
categorization.
|
|
|
|
If you omit the `categorization_analyzer`, the following default values are used:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST _xpack/ml/anomaly_detectors/_validate
|
|
{
|
|
"analysis_config" : {
|
|
"categorization_analyzer" : {
|
|
"tokenizer" : "ml_classic",
|
|
"filter" : [
|
|
{ "type" : "stop", "stopwords": [
|
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
|
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
|
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
|
"GMT", "UTC"
|
|
] }
|
|
]
|
|
},
|
|
"categorization_field_name": "message",
|
|
"detectors" :[{
|
|
"function":"count",
|
|
"by_field_name": "mlcategory"
|
|
}]
|
|
},
|
|
"data_description" : {
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
If you specify any part of the `categorization_analyzer`, however, any omitted
|
|
sub-properties are _not_ set to default values.
|
|
|
|
If you are categorizing non-English messages in a language where words are
|
|
separated by spaces, you might get better results if you change the day or month
|
|
words in the stop token filter to the appropriate words in your language. If you
|
|
are categorizing messages in a language where words are not separated by spaces,
|
|
you must use a different tokenizer as well in order to get sensible
|
|
categorization results.
|
|
|
|
It is important to be aware that analyzing for categorization of machine
|
|
generated log messages is a little different from tokenizing for search.
|
|
Features that work well for search, such as stemming, synonym substitution, and
|
|
lowercasing are likely to make the results of categorization worse. However, in
|
|
order for drill down from {ml} results to work correctly, the tokens that the
|
|
categorization analyzer produces must be similar to those produced by the search
|
|
analyzer. If they are sufficiently similar, when you search for the tokens that
|
|
the categorization analyzer produces then you find the original document that
|
|
the categorization field value came from.
|
|
|
|
For more information, see
|
|
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
|
|
|
[float]
|
|
[[ml-apilimits]]
|
|
==== Analysis Limits
|
|
|
|
Limits can be applied for the resources required to hold the mathematical models in memory.
|
|
These limits are approximate and can be set per job. They do not control the
|
|
memory used by other processes, for example the Elasticsearch Java processes.
|
|
If necessary, you can increase the limits after the job is created.
|
|
|
|
The `analysis_limits` object has the following properties:
|
|
|
|
`categorization_examples_limit`::
|
|
(long) The maximum number of examples stored per category in memory and
|
|
in the results data store. The default value is 4. If you increase this value,
|
|
more examples are available, however it requires that you have more storage available.
|
|
If you set this value to `0`, no examples are stored. +
|
|
+
|
|
--
|
|
NOTE: The `categorization_examples_limit` only applies to analysis that uses categorization.
|
|
For more information, see
|
|
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
|
|
|
--
|
|
|
|
`model_memory_limit`::
|
|
(long or string) The approximate maximum amount of memory resources that are
|
|
required for analytical processing. Once this limit is approached, data pruning
|
|
becomes more aggressive. Upon exceeding this limit, new entities are not
|
|
modeled. The default value for jobs created in version 6.1 and later is `1024mb`.
|
|
This value will need to be increased for jobs that are expected to analyze high
|
|
cardinality fields, but the default is set to a relatively small size to ensure
|
|
that high resource usage is a conscious decision. The default value for jobs
|
|
created in versions earlier than 6.1 is `4096mb`.
|
|
+
|
|
--
|
|
If you specify a number instead of a string, the units are assumed to be MiB.
|
|
Specifying a string is recommended for clarity. If you specify a byte size unit
|
|
of `b` or `kb` and the number does not equate to a discrete number of megabytes,
|
|
it is rounded down to the closest MiB. The minimum valid value is 1 MiB. If you
|
|
specify a value less than 1 MiB, an error occurs. For more information about
|
|
supported byte size units, see
|
|
{ref}/common-options.html#byte-units[Byte size units].
|
|
|
|
If your `elasticsearch.yml` file contains an `xpack.ml.max_model_memory_limit`
|
|
setting, an error occurs when you try to create jobs that have
|
|
`model_memory_limit` values greater than that setting. For more information,
|
|
see <<ml-settings>>.
|
|
--
|
|
|
|
[float]
|
|
[[ml-apimodelplotconfig]]
|
|
==== Model Plot Config
|
|
|
|
This advanced configuration option stores model information along with the
|
|
results. It provides a more detailed view into anomaly detection.
|
|
|
|
WARNING: If you enable model plot it can add considerable overhead to the performance
|
|
of the system; it is not feasible for jobs with many entities.
|
|
|
|
Model plot provides a simplified and indicative view of the model and its bounds.
|
|
It does not display complex features such as multivariate correlations or multimodal data.
|
|
As such, anomalies may occasionally be reported which cannot be seen in the model plot.
|
|
|
|
Model plot config can be configured when the job is created or updated later. It must be
|
|
disabled if performance issues are experienced.
|
|
|
|
The `model_plot_config` object has the following properties:
|
|
|
|
`enabled`::
|
|
(boolean) If true, enables calculation and storage of the model bounds for
|
|
each entity that is being analyzed. By default, this is not enabled.
|
|
|
|
`terms`::
|
|
experimental[] (string) Limits data collection to this comma separated list of
|
|
partition or by field values. If terms are not specified or it is an empty
|
|
string, no filtering is applied. For example, "CPU,NetworkIn,DiskWrites".
|
|
Wildcards are not supported. Only the specified `terms` can be viewed when
|
|
using the Single Metric Viewer.
|