[DOCS] Adds text about data types to the categorization docs (#51145)
This commit is contained in:
parent
ccf3e443b5
commit
83c92cf7eb
|
@ -1,6 +1,28 @@
|
||||||
[role="xpack"]
|
[role="xpack"]
|
||||||
[[ml-configuring-categories]]
|
[[ml-configuring-categories]]
|
||||||
=== Categorizing log messages
|
=== Categorizing data
|
||||||
|
|
||||||
|
Categorization is a {ml} process that considers a tokenization of a field,
|
||||||
|
clusters similar data together, and classifies them into categories. However,
|
||||||
|
categorization doesn't work equally well on different data types. It works
|
||||||
|
best on machine-written messages and application outputs, typically on data that
|
||||||
|
consists of repeated elements, for example log messages for the purpose of
|
||||||
|
system troubleshooting. Log categorization groups unstructured log messages into
|
||||||
|
categories, then you can use {anomaly-detect} to model and identify rare or
|
||||||
|
unusual counts of log message categories.
|
||||||
|
|
||||||
|
Categorization is tuned to work best on data like log messages by taking token
|
||||||
|
order into account, not considering synonyms, and including stop words in its
|
||||||
|
analysis. Complete sentences in human communication or literary text (for
|
||||||
|
example emails, wiki pages, prose, or other human generated content) can be
|
||||||
|
extremely diverse in structure. Since categorization is tuned for machine data
|
||||||
|
it will give poor results on such human generated data. For example, the
|
||||||
|
categorization job would create so many categories that couldn't be handled
|
||||||
|
effectively. Categorization is _not_ natural language processing (NLP).
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[ml-categorization-log-messages]]
|
||||||
|
==== Categorizing log messages
|
||||||
|
|
||||||
Application log events are often unstructured and contain variable data. For
|
Application log events are often unstructured and contain variable data. For
|
||||||
example:
|
example:
|
||||||
|
@ -65,8 +87,8 @@ defining categories. The categorization filters are applied in the order they
|
||||||
are listed in the job configuration, which allows you to disregard multiple
|
are listed in the job configuration, which allows you to disregard multiple
|
||||||
sections of the categorization field value. In this example, we have decided that
|
sections of the categorization field value. In this example, we have decided that
|
||||||
we do not want the detailed SQL to be considered in the message categorization.
|
we do not want the detailed SQL to be considered in the message categorization.
|
||||||
This particular categorization filter removes the SQL statement from the categorization
|
This particular categorization filter removes the SQL statement from the
|
||||||
algorithm.
|
categorization algorithm.
|
||||||
|
|
||||||
If your data is stored in {es}, you can create an advanced {anomaly-job} with
|
If your data is stored in {es}, you can create an advanced {anomaly-job} with
|
||||||
these same properties:
|
these same properties:
|
||||||
|
@ -79,7 +101,7 @@ NOTE: To add the `categorization_examples_limit` property, you must use the
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[ml-configuring-analyzer]]
|
[[ml-configuring-analyzer]]
|
||||||
==== Customizing the categorization analyzer
|
===== Customizing the categorization analyzer
|
||||||
|
|
||||||
Categorization uses English dictionary words to identify log message categories.
|
Categorization uses English dictionary words to identify log message categories.
|
||||||
By default, it also uses English tokenization rules. For this reason, if you use
|
By default, it also uses English tokenization rules. For this reason, if you use
|
||||||
|
@ -135,7 +157,8 @@ here achieves exactly the same as the `categorization_filters` in the first
|
||||||
example.
|
example.
|
||||||
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
|
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
|
||||||
that was used for categorization in older versions of machine learning. If you
|
that was used for categorization in older versions of machine learning. If you
|
||||||
want the same categorization behavior as older versions, use this property value.
|
want the same categorization behavior as older versions, use this property
|
||||||
|
value.
|
||||||
<3> By default, English day or month words are filtered from log messages before
|
<3> By default, English day or month words are filtered from log messages before
|
||||||
categorization. If your logs are in a different language and contain
|
categorization. If your logs are in a different language and contain
|
||||||
dates, you might get better results by filtering the day or month words in your
|
dates, you might get better results by filtering the day or month words in your
|
||||||
|
@ -178,9 +201,9 @@ POST _ml/anomaly_detectors/_validate
|
||||||
If you specify any part of the `categorization_analyzer`, however, any omitted
|
If you specify any part of the `categorization_analyzer`, however, any omitted
|
||||||
sub-properties are _not_ set to default values.
|
sub-properties are _not_ set to default values.
|
||||||
|
|
||||||
The `ml_classic` tokenizer and the day and month stopword filter are more or less
|
The `ml_classic` tokenizer and the day and month stopword filter are more or
|
||||||
equivalent to the following analyzer, which is defined using only built-in {es}
|
less equivalent to the following analyzer, which is defined using only built-in
|
||||||
{ref}/analysis-tokenizers.html[tokenizers] and
|
{es} {ref}/analysis-tokenizers.html[tokenizers] and
|
||||||
{ref}/analysis-tokenfilters.html[token filters]:
|
{ref}/analysis-tokenfilters.html[token filters]:
|
||||||
|
|
||||||
[source,console]
|
[source,console]
|
||||||
|
@ -234,11 +257,11 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
|
||||||
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
|
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
|
||||||
<5> Underscores, hyphens, and dots are also removed from the end of tokens.
|
<5> Underscores, hyphens, and dots are also removed from the end of tokens.
|
||||||
|
|
||||||
The key difference between the default `categorization_analyzer` and this example
|
The key difference between the default `categorization_analyzer` and this
|
||||||
analyzer is that using the `ml_classic` tokenizer is several times faster. The
|
example analyzer is that using the `ml_classic` tokenizer is several times
|
||||||
difference in behavior is that this custom analyzer does not include accented
|
faster. The difference in behavior is that this custom analyzer does not include
|
||||||
letters in tokens whereas the `ml_classic` tokenizer does, although that could
|
accented letters in tokens whereas the `ml_classic` tokenizer does, although
|
||||||
be fixed by using more complex regular expressions.
|
that could be fixed by using more complex regular expressions.
|
||||||
|
|
||||||
If you are categorizing non-English messages in a language where words are
|
If you are categorizing non-English messages in a language where words are
|
||||||
separated by spaces, you might get better results if you change the day or month
|
separated by spaces, you might get better results if you change the day or month
|
||||||
|
@ -263,7 +286,7 @@ API examples above.
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[ml-viewing-categories]]
|
[[ml-viewing-categories]]
|
||||||
==== Viewing categorization results
|
===== Viewing categorization results
|
||||||
|
|
||||||
After you open the job and start the {dfeed} or supply data to the job, you can
|
After you open the job and start the {dfeed} or supply data to the job, you can
|
||||||
view the categorization results in {kib}. For example:
|
view the categorization results in {kib}. For example:
|
||||||
|
|
Loading…
Reference in New Issue