OpenSearch/docs/en/ml/limitations.asciidoc

[[ml-limitations]]
== Machine Learning Limitations

The following limitations and known problems apply to the {version} release of
{xpack}:

[float]
=== Categorization uses English dictionary words
//See x-pack-elasticsearch/#3021
Categorization identifies static parts of unstructured logs and groups similar
messages together. The default categorization tokenizer assumes English language
log messages. For other languages you must define a different
`categorization_analyzer` for your job. For more information, see
<<ml-configuring-categories>>.

Additionally, a dictionary used to influence the categorization process contains
only English words. This means categorization might work better in English than
in other languages. The ability to customize the dictionary will be added in a
future release.

[float]
=== Pop-ups must be enabled in browsers
//See x-pack-elasticsearch/#844

The {xpackml} features in {kib} use pop-ups. You must configure your
web browser so that it does not block pop-up windows or create an
exception for your {kib} URL.

[float]
=== {xpackml} features do not yet support cross cluster search

At this time, you cannot use cross cluster search in either the {ml} APIs or the
{ml} features in {kib}.

For more information about cross cluster search,
see {ref}/modules-cross-cluster-search.html[Cross Cluster Search].

[float]
=== Anomaly Explorer omissions and limitations
//See x-pack-elasticsearch/#844 and x-pack-kibana/#1461

In {kib}, Anomaly Explorer charts are not displayed for anomalies
that were due to categorization, `time_of_day` functions, or `time_of_week`
functions. Those particular results do not display well as time series
charts.

The charts are also not displayed for detectors that use script fields. In that
case, the original source data cannot be easily searched because it has been
somewhat transformed by the script.

The Anomaly Explorer charts can also look odd in circumstances where there
is very little data to plot. For example, if there is only one data point, it is
represented as a single dot. If there are only two data points, they are joined
by a line.

[float]
=== Jobs close on the {dfeed} end date
//See x-pack-elasticsearch/#1037

If you start a {dfeed} and specify an end date, it will close the job when
the {dfeed} stops. This behavior avoids having numerous open one-time jobs.

If you do not specify an end date when you start a {dfeed}, the job
remains open when you stop the {dfeed}. This behavior avoids the overhead
of closing and re-opening large jobs when there are pauses in the {dfeed}.

[float]
=== Post data API requires JSON format

The post data API enables you to send data to a job for analysis. The data that
you send to the job must use the JSON format.

For more information about this API, see
{ref}/ml-post-data.html[Post Data to Jobs].


[float]
=== Misleading high missing field counts
//See x-pack-elasticsearch/#684

One of the counts associated with a {ml} job is `missing_field_count`,
which indicates the number of records that are missing a configured field.
//This information is most useful when your job analyzes CSV data.  In this case,
//missing fields indicate data is not being analyzed and you might receive poor results.

Since jobs analyze JSON data, the `missing_field_count` might be misleading.
Missing fields might be expected due to the structure of the data and therefore
do not generate poor results.

For more information about `missing_field_count`,
see {ref}/ml-jobstats.html#ml-datacounts[Data Counts Objects].


[float]
=== Terms aggregation size affects data analysis
//See x-pack-elasticsearch/#601

By default, the `terms` aggregation returns the buckets for the top ten terms.
You can change this default behavior by setting the `size` parameter.

If you are send pre-aggregated data to a job for analysis, you must ensure
that the `size` is configured correctly. Otherwise, some data might not be
analyzed.


[float]
=== Time-based index patterns are not supported
//See x-pack-elasticsearch/#1910

It is not possible to create an {xpackml} analysis job that uses time-based
index patterns, for example `[logstash-]YYYY.MM.DD`.
This applies to the single metric or multi metric job creation wizards in {kib}.


[float]
=== Fields named "by", "count", or "over" cannot be used to split data
//See x-pack-elasticsearch/#858

You cannot use the following field names in the `by_field_name` or
`over_field_name` properties in a job: `by`; `count`; `over`. This limitation
also applies to those properties when you create advanced jobs in {kib}.


[float]
=== Jobs created in {kib} use model plot config and pre-aggregated data
//See x-pack-elasticsearch/#844

If you create single or multi-metric jobs in {kib}, it might enable some
options under the covers that you'd want to reconsider for large or
long-running jobs.

For example, when you create a single metric job in {kib}, it generally
enables the `model_plot_config` advanced configuration option. That configuration
option causes model information to be stored along with the results and provides
a more detailed view into anomaly detection. It is specifically used by the
**Single Metric Viewer** in {kib}. When this option is enabled, however, it can
add considerable overhead to the performance of the system. If you have jobs
with many entities, for example data from tens of thousands of servers, storing
this additional model information for every bucket might be problematic. If you
are not certain that you need this option or if you experience performance
issues, edit your job configuration to disable this option.

For more information, see
{ref}/ml-job-resource.html#ml-apimodelplotconfig[Model Plot Config].

Likewise, when you create a single or multi-metric job in {kib}, in some cases
it uses aggregations on the data that it retrieves from {es}. One of the
benefits of summarizing data this way is that {es} automatically distributes
these calculations across your cluster. This summarized data is then fed into
{xpackml} instead of raw results, which reduces the volume of data that must
be considered while detecting anomalies.  However, if you have two jobs, one of
which uses pre-aggregated data and another that does not, their results might
differ. This difference is due to the difference in precision of the input data.
The {ml} analytics are designed to be aggregation-aware and the likely increase
in performance that is gained by pre-aggregating the data makes the potentially
poorer precision worthwhile. If you want to view or change the aggregations
that are used in your job, refer to the `aggregations` property in your {dfeed}.

For more information, see {ref}/ml-datafeed-resource.html[Datafeed Resources].

[float]
=== Security Integration

When {security} is enabled, a {dfeed} stores the roles of the user who created
or updated the {dfeed} **at that time**. This means that if those roles are
updated then the {dfeed} subsequently runs with the new permissions that are
associated with the roles. However, if the user's roles are adjusted after
creating or updating the {dfeed}, the {dfeed} continues to run with the
permissions that were associated with the original roles. For more information,
see <<ml-dfeeds>>.

[float]
=== Forecasts cannot be created for population jobs

If you use an `over_field_name` property in your job (that is to say, it's a
_population job_), you cannot create a forecast. If you try to create a forecast
for this type of job, an error occurs. For more information about forecasts,
see <<ml-forecasting>>.

[float]
=== Forecasts cannot be created for jobs that use geographic, rare, or time functions

If you use any of the following analytical functions in your job, you cannot
create a forecast:

* `lat_long`
* `rare` and `freq_rare`
* `time_of_day` and `time_of_week`

If you try to create a forecast for this type of job, an error occurs. For more
information about any of these functions, see <<ml-functions>>.