[DOCS] Edited documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3587)

Original commit: elastic/x-pack-elasticsearch@6dd179107a
This commit is contained in:
Lisa Cawley 2018-01-17 13:11:36 -08:00 committed by GitHub
parent 60d4b7e53e
commit 9f6064f9ac
3 changed files with 98 additions and 82 deletions

View File

@ -13,10 +13,6 @@ example:
You can use {ml} to observe the static parts of the message, cluster similar
messages together, and classify them into message categories.
NOTE: Categorization uses English tokenization rules and dictionary words in
order to identify log message categories. As such, only English language log
messages are supported.
The {ml} model learns what volume and pattern is normal for each category over
time. You can then detect anomalies and surface rare events or unusual types of
messages by using count or rare functions. For example:
@ -79,8 +75,17 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
NOTE: To add the `categorization_examples_limit` property, you must use the
**Edit JSON** tab and copy the `analysis_limits` object from the API example.
It is possible to customize the way the categorization field values are interpreted
to an even greater extent:
[float]
[[ml-configuring-analyzer]]
==== Customizing the Categorization Analyzer
Categorization uses English dictionary words to identify log message categories.
By default, it also uses English tokenization rules. For this reason, if you use
the default categorization analyzer, only English language log messages are
supported, as described in the <<ml-limitations>>.
You can, however, change the tokenization rules by customizing the way the
categorization field values are interpreted. For example:
[source,js]
----------------------------------
@ -126,20 +131,20 @@ PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
here achieves exactly the same as the `categorization_filters` in the first
example.
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
that was used for categorization in older versions of machine learning. Use
it if you want the same categorization behavior as older versions.
<3> English day/month words are filtered by default from log messages
before categorization. If your logs are in a different language and contain
dates then you may get better results by filtering day/month words in your
that was used for categorization in older versions of machine learning. If you
want the same categorization behavior as older versions, use this property value.
<3> By default, English day or month words are filtered from log messages before
categorization. If your logs are in a different language and contain
dates, you might get better results by filtering the day or month words in your
language.
The optional `categorization_analyzer` property allows even greater customization
of how categorization interprets the categorization field value. It can refer to
a built-in Elasticsearch analyzer, or a combination of zero or more character
filters, a tokenizer, and zero or more token filters.
a built-in {es} analyzer or a combination of zero or more character filters,
a tokenizer, and zero or more token filters.
The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
equivalent to the following analyzer defined using only built-in Elasticsearch
The `ml_classic` tokenizer and the day and month stopword filter are more or less
equivalent to the following analyzer, which is defined using only built-in {es}
{ref}/analysis-tokenizers.html[tokenizers] and
{ref}/analysis-tokenfilters.html[token filters]:
@ -188,23 +193,30 @@ PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
----------------------------------
//CONSOLE
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
<2> By default categorization ignores tokens that begin with a digit.
<3> By default categorization also ignores tokens that are hexadecimal numbers.
<4> Underscores, hypens and dots are removed from the beginning of tokens.
<5> Also at the end of tokens.
<2> By default, categorization ignores tokens that begin with a digit.
<3> By default, categorization also ignores tokens that are hexadecimal numbers.
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
<5> Underscores, hyphens, and dots are also removed from the end of tokens.
The key difference between the default `categorization_analyzer` and this example
analyzer is that using the `ml_classic` tokenizer is several times faster. (The
analyzer is that using the `ml_classic` tokenizer is several times faster. The
difference in behavior is that this custom analyzer does not include accented
letters in tokens whereas the `ml_classic` tokenizer will, although that could be
fixed by using more complex regular expressions.)
letters in tokens whereas the `ml_classic` tokenizer does, although that could
be fixed by using more complex regular expressions.
NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
tab and copy the `categorization_analyzer` object from one of the API examples above.
For more information about the `categorization_analyzer` property, see
{ref}/ml-job-resource.html#ml-categorizationanalyzer[Categorization Analyzer].
NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
**Edit JSON** tab and copy the `categorization_analyzer` object from one of the
API examples above.
[float]
[[ml-viewing-categories]]
==== Viewing Categorization Results
After you open the job and start the {dfeed} or supply data to the job, you can
view the results in {kib}. For example:
view the categorization results in {kib}. For example:
[role="screenshot"]
image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]

View File

@ -10,10 +10,13 @@ The following limitations and known problems apply to the {version} release of
Categorization identifies static parts of unstructured logs and groups similar
messages together. The default categorization tokenizer assumes English language
log messages. For other languages you must define a different
categorization_analyzer for your job. Additionally, a dictionary used to influence
the categorization process contains only English words. This means categorization
may work better in English than in other languages. The ability to customize the
dictionary will be added in a future release.
`categorization_analyzer` for your job. For more information, see
<<ml-configuring-categories>>.
Additionally, a dictionary used to influence the categorization process contains
only English words. This means categorization might work better in English than
in other languages. The ability to customize the dictionary will be added in a
future release.
[float]
=== Pop-ups must be enabled in browsers

View File

@ -105,24 +105,23 @@ An analysis configuration object has the following properties:
(array of strings) If `categorization_field_name` is specified,
you can also define optional filters. This property expects an array of
regular expressions. The expressions are used to filter out matching sequences
off the categorization field values. This functionality is useful to fine tune
categorization by excluding sequences that should not be taken into
consideration for defining categories. For example, you can exclude SQL
statements that appear in your log files. For more information, see
from the categorization field values. You can use this functionality to fine
tune the categorization by excluding sequences from consideration when
categories are defined. For example, you can exclude SQL statements that
appear in your log files. For more information, see
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
This property cannot be used at the same time as `categorization_analyzer`.
If you only want to define simple regular expression filters to be applied
prior to tokenization then it is easiest to specify them using this property.
If you also want to customize the tokenizer or post-tokenization filtering
then these filters must be included in the `categorization_analyzer` as
`pattern_replace` `char_filter`s. The effect is exactly the same.
//<<ml-configuring-categories>>.
If you only want to define simple regular expression filters that are applied
prior to tokenization, setting this property is the easiest method.
If you also want to customize the tokenizer or post-tokenization filtering,
use the `categorization_analyzer` property instead and include the filters as
`pattern_replace` character filters. The effect is exactly the same.
`categorization_analyzer`::
(object or string) If `categorization_field_name` is specified,
you can also define the analyzer that will be used to interpret the field
to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
//<<ml-configuring-categories>>.
(object or string) If `categorization_field_name` is specified, you can also
define the analyzer that is used to interpret the categorization field. This
property cannot be used at the same time as `categorization_filters`. See
<<ml-categorizationanalyzer,categorization analyzer>>.
`detectors`::
(array) An array of detector configuration objects,
@ -316,39 +315,40 @@ used to define the `analyzer` in the <<indices-analyze,Analyze endpoint>>.
The `categorization_analyzer` field can be specified either as a string or as
an object.
If it is a string it must refer to a
{ref}/analysis-analyzers.html[built-in analyzer] or one added by
another plugin.
If it is a string it must refer to a <<analysis-analyzers,built-in analyzer>> or
one added by another plugin.
If it is an object it has the following properties:
`char_filter`::
(array of strings or objects) One or more
{ref}/analysis-charfilters.html[character filters]. In addition
to the built-in character filters other plugins may provide more. This property
is optional. If not specified then there will be no character filters. If
you are customizing some other aspect of the analyzer and need to achieve
the equivalent of `categorization_filters` (which are not permitted when some
other aspect of the analyzer is customized), add them here as
{ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
<<analysis-charfilters,character filters>>. In addition to the built-in
character filters, other plugins can provide more character filters. This
property is optional. If it is not specified, no character filters are applied
prior to categorization. If you are customizing some other aspect of the
analyzer and you need to achieve the equivalent of `categorization_filters`
(which are not permitted when some other aspect of the analyzer is customized),
add them here as
<<analysis-pattern-replace-charfilter,pattern replace character filters>>.
`tokenizer`::
(string or object) The name or definition of the
{ref}/analysis-tokenizers.html[tokenizer] to use after character
filters have been applied. This property is compulsory if `categorization_analyzer`
is specified as an object. Machine learning provides a tokenizer called `ml_classic`
that tokenizes in the same way as the non-customizable tokenizer in older versions of
the product. If you would like to stick with this but change the character or token
filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
<<analysis-tokenizers,tokenizer>> to use after character filters are applied.
This property is compulsory if `categorization_analyzer` is specified as an
object. Machine learning provides a tokenizer called `ml_classic` that
tokenizes in the same way as the non-customizable tokenizer in older versions
of the product. If you want to use that tokenizer but change the character or
token filters, specify `"tokenizer": "ml_classic"` in your
`categorization_analyzer`.
`filter`::
(array of strings or objects) One or more
{ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
filters other plugins may provide more. This property is optional. If not specified
then there will be no token filters.
<<analysis-tokenfilters,token filters>>. In addition to the built-in token
filters, other plugins can provide more token filters. This property is
optional. If it is not specified, no token filters are applied prior to
categorization.
If you omit `categorization_analyzer` entirely then the default that will be used is
the one from the following job:
If you omit the `categorization_analyzer`, the following default values are used:
[source,js]
--------------------------------------------------
@ -379,27 +379,28 @@ POST _xpack/ml/anomaly_detectors/_validate
--------------------------------------------------
// CONSOLE
However, if you specify any part of `categorization_analyzer` then any omitted
sub-properties are _not_ defaulted.
If you specify any part of the `categorization_analyzer`, however, any omitted
sub-properties are _not_ set to default values.
If you are categorizing non-English messages in a language where words are separated
by spaces you may get better results if you change the day/month words in the stop
token filter to those from your language. If you are categorizing messages in a language
where words are not separated by spaces then you will need to use a different tokenizer
as well in order to get sensible categorization results.
If you are categorizing non-English messages in a language where words are
separated by spaces, you might get better results if you change the day or month
words in the stop token filter to the appropriate words in your language. If you
are categorizing messages in a language where words are not separated by spaces,
you must use a different tokenizer as well in order to get sensible
categorization results.
It is important to be aware that analyzing for categorization of machine generated
log messages is a little different to tokenizing for search. Features that work well
for search, such as stemming, synonym substitution and lowercasing are likely to make
the results of categorization worse. However, in order for drilldown from machine
learning results to work correctly, the tokens that the categorization analyzer
produces need to be sufficiently similar to those produced by the search analyzer
that if you search for the tokens that the categorization analyzer produces you will
find the original document that the field to be categorized came from.
For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
//<<ml-configuring-categories>>.
It is important to be aware that analyzing for categorization of machine
generated log messages is a little different from tokenizing for search.
Features that work well for search, such as stemming, synonym substitution, and
lowercasing are likely to make the results of categorization worse. However, in
order for drill down from {ml} results to work correctly, the tokens that the
categorization analyzer produces must be similar to those produced by the search
analyzer. If they are sufficiently similar, when you search for the tokens that
the categorization analyzer produces then you find the original document that
the categorization field value came from.
For more information, see
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
[float]
[[ml-apilimits]]