mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-23 05:15:04 +00:00
[DOCS] Edited documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3587)
Original commit: elastic/x-pack-elasticsearch@6dd179107a
This commit is contained in:
parent
60d4b7e53e
commit
9f6064f9ac
@ -13,10 +13,6 @@ example:
|
||||
You can use {ml} to observe the static parts of the message, cluster similar
|
||||
messages together, and classify them into message categories.
|
||||
|
||||
NOTE: Categorization uses English tokenization rules and dictionary words in
|
||||
order to identify log message categories. As such, only English language log
|
||||
messages are supported.
|
||||
|
||||
The {ml} model learns what volume and pattern is normal for each category over
|
||||
time. You can then detect anomalies and surface rare events or unusual types of
|
||||
messages by using count or rare functions. For example:
|
||||
@ -79,8 +75,17 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
|
||||
NOTE: To add the `categorization_examples_limit` property, you must use the
|
||||
**Edit JSON** tab and copy the `analysis_limits` object from the API example.
|
||||
|
||||
It is possible to customize the way the categorization field values are interpreted
|
||||
to an even greater extent:
|
||||
[float]
|
||||
[[ml-configuring-analyzer]]
|
||||
==== Customizing the Categorization Analyzer
|
||||
|
||||
Categorization uses English dictionary words to identify log message categories.
|
||||
By default, it also uses English tokenization rules. For this reason, if you use
|
||||
the default categorization analyzer, only English language log messages are
|
||||
supported, as described in the <<ml-limitations>>.
|
||||
|
||||
You can, however, change the tokenization rules by customizing the way the
|
||||
categorization field values are interpreted. For example:
|
||||
|
||||
[source,js]
|
||||
----------------------------------
|
||||
@ -126,20 +131,20 @@ PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
|
||||
here achieves exactly the same as the `categorization_filters` in the first
|
||||
example.
|
||||
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
|
||||
that was used for categorization in older versions of machine learning. Use
|
||||
it if you want the same categorization behavior as older versions.
|
||||
<3> English day/month words are filtered by default from log messages
|
||||
before categorization. If your logs are in a different language and contain
|
||||
dates then you may get better results by filtering day/month words in your
|
||||
that was used for categorization in older versions of machine learning. If you
|
||||
want the same categorization behavior as older versions, use this property value.
|
||||
<3> By default, English day or month words are filtered from log messages before
|
||||
categorization. If your logs are in a different language and contain
|
||||
dates, you might get better results by filtering the day or month words in your
|
||||
language.
|
||||
|
||||
The optional `categorization_analyzer` property allows even greater customization
|
||||
of how categorization interprets the categorization field value. It can refer to
|
||||
a built-in Elasticsearch analyzer, or a combination of zero or more character
|
||||
filters, a tokenizer, and zero or more token filters.
|
||||
a built-in {es} analyzer or a combination of zero or more character filters,
|
||||
a tokenizer, and zero or more token filters.
|
||||
|
||||
The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
|
||||
equivalent to the following analyzer defined using only built-in Elasticsearch
|
||||
The `ml_classic` tokenizer and the day and month stopword filter are more or less
|
||||
equivalent to the following analyzer, which is defined using only built-in {es}
|
||||
{ref}/analysis-tokenizers.html[tokenizers] and
|
||||
{ref}/analysis-tokenfilters.html[token filters]:
|
||||
|
||||
@ -188,23 +193,30 @@ PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
|
||||
----------------------------------
|
||||
//CONSOLE
|
||||
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
|
||||
<2> By default categorization ignores tokens that begin with a digit.
|
||||
<3> By default categorization also ignores tokens that are hexadecimal numbers.
|
||||
<4> Underscores, hypens and dots are removed from the beginning of tokens.
|
||||
<5> Also at the end of tokens.
|
||||
<2> By default, categorization ignores tokens that begin with a digit.
|
||||
<3> By default, categorization also ignores tokens that are hexadecimal numbers.
|
||||
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
|
||||
<5> Underscores, hyphens, and dots are also removed from the end of tokens.
|
||||
|
||||
The key difference between the default `categorization_analyzer` and this example
|
||||
analyzer is that using the `ml_classic` tokenizer is several times faster. (The
|
||||
analyzer is that using the `ml_classic` tokenizer is several times faster. The
|
||||
difference in behavior is that this custom analyzer does not include accented
|
||||
letters in tokens whereas the `ml_classic` tokenizer will, although that could be
|
||||
fixed by using more complex regular expressions.)
|
||||
letters in tokens whereas the `ml_classic` tokenizer does, although that could
|
||||
be fixed by using more complex regular expressions.
|
||||
|
||||
NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
|
||||
tab and copy the `categorization_analyzer` object from one of the API examples above.
|
||||
For more information about the `categorization_analyzer` property, see
|
||||
{ref}/ml-job-resource.html#ml-categorizationanalyzer[Categorization Analyzer].
|
||||
|
||||
NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
|
||||
**Edit JSON** tab and copy the `categorization_analyzer` object from one of the
|
||||
API examples above.
|
||||
|
||||
[float]
|
||||
[[ml-viewing-categories]]
|
||||
==== Viewing Categorization Results
|
||||
|
||||
After you open the job and start the {dfeed} or supply data to the job, you can
|
||||
view the results in {kib}. For example:
|
||||
view the categorization results in {kib}. For example:
|
||||
|
||||
[role="screenshot"]
|
||||
image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]
|
||||
|
@ -10,10 +10,13 @@ The following limitations and known problems apply to the {version} release of
|
||||
Categorization identifies static parts of unstructured logs and groups similar
|
||||
messages together. The default categorization tokenizer assumes English language
|
||||
log messages. For other languages you must define a different
|
||||
categorization_analyzer for your job. Additionally, a dictionary used to influence
|
||||
the categorization process contains only English words. This means categorization
|
||||
may work better in English than in other languages. The ability to customize the
|
||||
dictionary will be added in a future release.
|
||||
`categorization_analyzer` for your job. For more information, see
|
||||
<<ml-configuring-categories>>.
|
||||
|
||||
Additionally, a dictionary used to influence the categorization process contains
|
||||
only English words. This means categorization might work better in English than
|
||||
in other languages. The ability to customize the dictionary will be added in a
|
||||
future release.
|
||||
|
||||
[float]
|
||||
=== Pop-ups must be enabled in browsers
|
||||
|
@ -105,24 +105,23 @@ An analysis configuration object has the following properties:
|
||||
(array of strings) If `categorization_field_name` is specified,
|
||||
you can also define optional filters. This property expects an array of
|
||||
regular expressions. The expressions are used to filter out matching sequences
|
||||
off the categorization field values. This functionality is useful to fine tune
|
||||
categorization by excluding sequences that should not be taken into
|
||||
consideration for defining categories. For example, you can exclude SQL
|
||||
statements that appear in your log files. For more information, see
|
||||
from the categorization field values. You can use this functionality to fine
|
||||
tune the categorization by excluding sequences from consideration when
|
||||
categories are defined. For example, you can exclude SQL statements that
|
||||
appear in your log files. For more information, see
|
||||
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
||||
This property cannot be used at the same time as `categorization_analyzer`.
|
||||
If you only want to define simple regular expression filters to be applied
|
||||
prior to tokenization then it is easiest to specify them using this property.
|
||||
If you also want to customize the tokenizer or post-tokenization filtering
|
||||
then these filters must be included in the `categorization_analyzer` as
|
||||
`pattern_replace` `char_filter`s. The effect is exactly the same.
|
||||
//<<ml-configuring-categories>>.
|
||||
If you only want to define simple regular expression filters that are applied
|
||||
prior to tokenization, setting this property is the easiest method.
|
||||
If you also want to customize the tokenizer or post-tokenization filtering,
|
||||
use the `categorization_analyzer` property instead and include the filters as
|
||||
`pattern_replace` character filters. The effect is exactly the same.
|
||||
|
||||
`categorization_analyzer`::
|
||||
(object or string) If `categorization_field_name` is specified,
|
||||
you can also define the analyzer that will be used to interpret the field
|
||||
to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
|
||||
//<<ml-configuring-categories>>.
|
||||
(object or string) If `categorization_field_name` is specified, you can also
|
||||
define the analyzer that is used to interpret the categorization field. This
|
||||
property cannot be used at the same time as `categorization_filters`. See
|
||||
<<ml-categorizationanalyzer,categorization analyzer>>.
|
||||
|
||||
`detectors`::
|
||||
(array) An array of detector configuration objects,
|
||||
@ -316,39 +315,40 @@ used to define the `analyzer` in the <<indices-analyze,Analyze endpoint>>.
|
||||
The `categorization_analyzer` field can be specified either as a string or as
|
||||
an object.
|
||||
|
||||
If it is a string it must refer to a
|
||||
{ref}/analysis-analyzers.html[built-in analyzer] or one added by
|
||||
another plugin.
|
||||
If it is a string it must refer to a <<analysis-analyzers,built-in analyzer>> or
|
||||
one added by another plugin.
|
||||
|
||||
If it is an object it has the following properties:
|
||||
|
||||
`char_filter`::
|
||||
(array of strings or objects) One or more
|
||||
{ref}/analysis-charfilters.html[character filters]. In addition
|
||||
to the built-in character filters other plugins may provide more. This property
|
||||
is optional. If not specified then there will be no character filters. If
|
||||
you are customizing some other aspect of the analyzer and need to achieve
|
||||
the equivalent of `categorization_filters` (which are not permitted when some
|
||||
other aspect of the analyzer is customized), add them here as
|
||||
{ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
|
||||
<<analysis-charfilters,character filters>>. In addition to the built-in
|
||||
character filters, other plugins can provide more character filters. This
|
||||
property is optional. If it is not specified, no character filters are applied
|
||||
prior to categorization. If you are customizing some other aspect of the
|
||||
analyzer and you need to achieve the equivalent of `categorization_filters`
|
||||
(which are not permitted when some other aspect of the analyzer is customized),
|
||||
add them here as
|
||||
<<analysis-pattern-replace-charfilter,pattern replace character filters>>.
|
||||
|
||||
`tokenizer`::
|
||||
(string or object) The name or definition of the
|
||||
{ref}/analysis-tokenizers.html[tokenizer] to use after character
|
||||
filters have been applied. This property is compulsory if `categorization_analyzer`
|
||||
is specified as an object. Machine learning provides a tokenizer called `ml_classic`
|
||||
that tokenizes in the same way as the non-customizable tokenizer in older versions of
|
||||
the product. If you would like to stick with this but change the character or token
|
||||
filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
|
||||
<<analysis-tokenizers,tokenizer>> to use after character filters are applied.
|
||||
This property is compulsory if `categorization_analyzer` is specified as an
|
||||
object. Machine learning provides a tokenizer called `ml_classic` that
|
||||
tokenizes in the same way as the non-customizable tokenizer in older versions
|
||||
of the product. If you want to use that tokenizer but change the character or
|
||||
token filters, specify `"tokenizer": "ml_classic"` in your
|
||||
`categorization_analyzer`.
|
||||
|
||||
`filter`::
|
||||
(array of strings or objects) One or more
|
||||
{ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
|
||||
filters other plugins may provide more. This property is optional. If not specified
|
||||
then there will be no token filters.
|
||||
<<analysis-tokenfilters,token filters>>. In addition to the built-in token
|
||||
filters, other plugins can provide more token filters. This property is
|
||||
optional. If it is not specified, no token filters are applied prior to
|
||||
categorization.
|
||||
|
||||
If you omit `categorization_analyzer` entirely then the default that will be used is
|
||||
the one from the following job:
|
||||
If you omit the `categorization_analyzer`, the following default values are used:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
@ -379,27 +379,28 @@ POST _xpack/ml/anomaly_detectors/_validate
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
However, if you specify any part of `categorization_analyzer` then any omitted
|
||||
sub-properties are _not_ defaulted.
|
||||
If you specify any part of the `categorization_analyzer`, however, any omitted
|
||||
sub-properties are _not_ set to default values.
|
||||
|
||||
If you are categorizing non-English messages in a language where words are separated
|
||||
by spaces you may get better results if you change the day/month words in the stop
|
||||
token filter to those from your language. If you are categorizing messages in a language
|
||||
where words are not separated by spaces then you will need to use a different tokenizer
|
||||
as well in order to get sensible categorization results.
|
||||
If you are categorizing non-English messages in a language where words are
|
||||
separated by spaces, you might get better results if you change the day or month
|
||||
words in the stop token filter to the appropriate words in your language. If you
|
||||
are categorizing messages in a language where words are not separated by spaces,
|
||||
you must use a different tokenizer as well in order to get sensible
|
||||
categorization results.
|
||||
|
||||
It is important to be aware that analyzing for categorization of machine generated
|
||||
log messages is a little different to tokenizing for search. Features that work well
|
||||
for search, such as stemming, synonym substitution and lowercasing are likely to make
|
||||
the results of categorization worse. However, in order for drilldown from machine
|
||||
learning results to work correctly, the tokens that the categorization analyzer
|
||||
produces need to be sufficiently similar to those produced by the search analyzer
|
||||
that if you search for the tokens that the categorization analyzer produces you will
|
||||
find the original document that the field to be categorized came from.
|
||||
|
||||
For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
||||
//<<ml-configuring-categories>>.
|
||||
It is important to be aware that analyzing for categorization of machine
|
||||
generated log messages is a little different from tokenizing for search.
|
||||
Features that work well for search, such as stemming, synonym substitution, and
|
||||
lowercasing are likely to make the results of categorization worse. However, in
|
||||
order for drill down from {ml} results to work correctly, the tokens that the
|
||||
categorization analyzer produces must be similar to those produced by the search
|
||||
analyzer. If they are sufficiently similar, when you search for the tokens that
|
||||
the categorization analyzer produces then you find the original document that
|
||||
the categorization field value came from.
|
||||
|
||||
For more information, see
|
||||
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
||||
|
||||
[float]
|
||||
[[ml-apilimits]]
|
||||
|
Loading…
x
Reference in New Issue
Block a user