[DOCS] Adds text about data types to the categorization docs (#51145)

This commit is contained in:
István Zoltán Szabó 2020-01-17 18:52:57 +01:00 committed by lcawl
parent ccf3e443b5
commit 83c92cf7eb
1 changed files with 37 additions and 14 deletions

View File

@ -1,6 +1,28 @@
[role="xpack"] [role="xpack"]
[[ml-configuring-categories]] [[ml-configuring-categories]]
=== Categorizing log messages === Categorizing data
Categorization is a {ml} process that considers a tokenization of a field,
clusters similar data together, and classifies them into categories. However,
categorization doesn't work equally well on different data types. It works
best on machine-written messages and application outputs, typically on data that
consists of repeated elements, for example log messages for the purpose of
system troubleshooting. Log categorization groups unstructured log messages into
categories, then you can use {anomaly-detect} to model and identify rare or
unusual counts of log message categories.
Categorization is tuned to work best on data like log messages by taking token
order into account, not considering synonyms, and including stop words in its
analysis. Complete sentences in human communication or literary text (for
example emails, wiki pages, prose, or other human generated content) can be
extremely diverse in structure. Since categorization is tuned for machine data
it will give poor results on such human generated data. For example, the
categorization job would create so many categories that couldn't be handled
effectively. Categorization is _not_ natural language processing (NLP).
[float]
[[ml-categorization-log-messages]]
==== Categorizing log messages
Application log events are often unstructured and contain variable data. For Application log events are often unstructured and contain variable data. For
example: example:
@ -65,8 +87,8 @@ defining categories. The categorization filters are applied in the order they
are listed in the job configuration, which allows you to disregard multiple are listed in the job configuration, which allows you to disregard multiple
sections of the categorization field value. In this example, we have decided that sections of the categorization field value. In this example, we have decided that
we do not want the detailed SQL to be considered in the message categorization. we do not want the detailed SQL to be considered in the message categorization.
This particular categorization filter removes the SQL statement from the categorization This particular categorization filter removes the SQL statement from the
algorithm. categorization algorithm.
If your data is stored in {es}, you can create an advanced {anomaly-job} with If your data is stored in {es}, you can create an advanced {anomaly-job} with
these same properties: these same properties:
@ -79,7 +101,7 @@ NOTE: To add the `categorization_examples_limit` property, you must use the
[float] [float]
[[ml-configuring-analyzer]] [[ml-configuring-analyzer]]
==== Customizing the categorization analyzer ===== Customizing the categorization analyzer
Categorization uses English dictionary words to identify log message categories. Categorization uses English dictionary words to identify log message categories.
By default, it also uses English tokenization rules. For this reason, if you use By default, it also uses English tokenization rules. For this reason, if you use
@ -135,7 +157,8 @@ here achieves exactly the same as the `categorization_filters` in the first
example. example.
<2> The `ml_classic` tokenizer works like the non-customizable tokenization <2> The `ml_classic` tokenizer works like the non-customizable tokenization
that was used for categorization in older versions of machine learning. If you that was used for categorization in older versions of machine learning. If you
want the same categorization behavior as older versions, use this property value. want the same categorization behavior as older versions, use this property
value.
<3> By default, English day or month words are filtered from log messages before <3> By default, English day or month words are filtered from log messages before
categorization. If your logs are in a different language and contain categorization. If your logs are in a different language and contain
dates, you might get better results by filtering the day or month words in your dates, you might get better results by filtering the day or month words in your
@ -178,9 +201,9 @@ POST _ml/anomaly_detectors/_validate
If you specify any part of the `categorization_analyzer`, however, any omitted If you specify any part of the `categorization_analyzer`, however, any omitted
sub-properties are _not_ set to default values. sub-properties are _not_ set to default values.
The `ml_classic` tokenizer and the day and month stopword filter are more or less The `ml_classic` tokenizer and the day and month stopword filter are more or
equivalent to the following analyzer, which is defined using only built-in {es} less equivalent to the following analyzer, which is defined using only built-in
{ref}/analysis-tokenizers.html[tokenizers] and {es} {ref}/analysis-tokenizers.html[tokenizers] and
{ref}/analysis-tokenfilters.html[token filters]: {ref}/analysis-tokenfilters.html[token filters]:
[source,console] [source,console]
@ -234,11 +257,11 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
<4> Underscores, hyphens, and dots are removed from the beginning of tokens. <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
<5> Underscores, hyphens, and dots are also removed from the end of tokens. <5> Underscores, hyphens, and dots are also removed from the end of tokens.
The key difference between the default `categorization_analyzer` and this example The key difference between the default `categorization_analyzer` and this
analyzer is that using the `ml_classic` tokenizer is several times faster. The example analyzer is that using the `ml_classic` tokenizer is several times
difference in behavior is that this custom analyzer does not include accented faster. The difference in behavior is that this custom analyzer does not include
letters in tokens whereas the `ml_classic` tokenizer does, although that could accented letters in tokens whereas the `ml_classic` tokenizer does, although
be fixed by using more complex regular expressions. that could be fixed by using more complex regular expressions.
If you are categorizing non-English messages in a language where words are If you are categorizing non-English messages in a language where words are
separated by spaces, you might get better results if you change the day or month separated by spaces, you might get better results if you change the day or month
@ -263,7 +286,7 @@ API examples above.
[float] [float]
[[ml-viewing-categories]] [[ml-viewing-categories]]
==== Viewing categorization results ===== Viewing categorization results
After you open the job and start the {dfeed} or supply data to the job, you can After you open the job and start the {dfeed} or supply data to the job, you can
view the categorization results in {kib}. For example: view the categorization results in {kib}. For example: