[DOCS] Add documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3554)

This is the documentation for the changes made in elastic/x-pack-elasticsearch#3372.

Relates elastic/machine-learning-cpp#491

Original commit: elastic/x-pack-elasticsearch@7d67e9d894
This commit is contained in:
David Roberts 2018-01-15 15:47:19 +00:00 committed by GitHub
parent d4cddc12d0
commit e9dafbd78d
3 changed files with 238 additions and 3 deletions

View File

@ -79,6 +79,129 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
NOTE: To add the `categorization_examples_limit` property, you must use the
**Edit JSON** tab and copy the `analysis_limits` object from the API example.
It is possible to customize the way the categorization field values are interpreted
to an even greater extent:
[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
{
"description" : "IT Ops Application Logs",
"analysis_config" : {
"categorization_field_name": "message",
"bucket_span":"30m",
"detectors" :[{
"function":"count",
"by_field_name": "mlcategory",
"detector_description": "Unusual message counts"
}],
"categorization_analyzer":{
"char_filter": [
{ "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
],
"tokenizer": "ml_classic", <2>
"filter": [
{ "type" : "stop", "stopwords": [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] } <3>
]
}
},
"analysis_limits":{
"categorization_examples_limit": 5
},
"data_description" : {
"time_field":"time",
"time_format": "epoch_ms"
}
}
----------------------------------
//CONSOLE
<1> The
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
here achieves exactly the same as the `categorization_filters` in the first
example.
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
that was used for categorization in older versions of machine learning. Use
it if you want the same categorization behavior as older versions.
<3> English day/month words are filtered by default from log messages
before categorization. If your logs are in a different language and contain
dates then you may get better results by filtering day/month words in your
language.
The optional `categorization_analyzer` property allows even greater customization
of how categorization interprets the categorization field value. It can refer to
a built-in Elasticsearch analyzer, or a combination of zero or more character
filters, a tokenizer, and zero or more token filters.
The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
equivalent to the following analyzer defined using only built-in Elasticsearch
{ref}/analysis-tokenizers.html[tokenizers] and
{ref}/analysis-tokenfilters.html[token filters]:
[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
{
"description" : "IT Ops Application Logs",
"analysis_config" : {
"categorization_field_name": "message",
"bucket_span":"30m",
"detectors" :[{
"function":"count",
"by_field_name": "mlcategory",
"detector_description": "Unusual message counts"
}],
"categorization_analyzer":{
"tokenizer": {
"type" : "simple_pattern_split",
"pattern" : "[^-0-9A-Za-z_.]+" <1>
},
"filter": [
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
{ "type" : "stop", "stopwords": [
"",
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] }
]
}
},
"analysis_limits":{
"categorization_examples_limit": 5
},
"data_description" : {
"time_field":"time",
"time_format": "epoch_ms"
}
}
----------------------------------
//CONSOLE
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
<2> By default categorization ignores tokens that begin with a digit.
<3> By default categorization also ignores tokens that are hexadecimal numbers.
<4> Underscores, hypens and dots are removed from the beginning of tokens.
<5> Also at the end of tokens.
The key difference between the default `categorization_analyzer` and this example
analyzer is that using the `ml_classic` tokenizer is several times faster. (The
difference in behavior is that this custom analyzer does not include accented
letters in tokens whereas the `ml_classic` tokenizer will, although that could be
fixed by using more complex regular expressions.)
NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
tab and copy the `categorization_analyzer` object from one of the API examples above.
After you open the job and start the {dfeed} or supply data to the job, you can
view the results in {kib}. For example:

View File

@ -5,11 +5,15 @@ The following limitations and known problems apply to the {version} release of
{xpack}:
[float]
=== Categorization uses English tokenization rules and dictionary words
=== Categorization uses English dictionary words
//See x-pack-elasticsearch/#3021
Categorization identifies static parts of unstructured logs and groups similar
messages together. This is currently supported only for English language log
messages.
messages together. The default categorization tokenizer assumes English language
log messages. For other languages you must define a different
categorization_analyzer for your job. Additionally, a dictionary used to influence
the categorization process contains only English words. This means categorization
may work better in English than in other languages. The ability to customize the
dictionary will be added in a future release.
[float]
=== Pop-ups must be enabled in browsers

View File

@ -110,6 +110,18 @@ An analysis configuration object has the following properties:
consideration for defining categories. For example, you can exclude SQL
statements that appear in your log files. For more information, see
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
This property cannot be used at the same time as `categorization_analyzer`.
If you only want to define simple regular expression filters to be applied
prior to tokenization then it is easiest to specify them using this property.
If you also want to customize the tokenizer or post-tokenization filtering
then these filters must be included in the `categorization_analyzer` as
`pattern_replace` `char_filter`s. The effect is exactly the same.
//<<ml-configuring-categories>>.
`categorization_analyzer`::
(object or string) If `categorization_field_name` is specified,
you can also define the analyzer that will be used to interpret the field
to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
//<<ml-configuring-categories>>.
`detectors`::
@ -293,6 +305,102 @@ job creation fails.
--
[float]
[[ml-categorizationanalyzer]]
==== Categorization Analyzer
The categorization analyzer specifies how the `categorization_field` will be
interpreted by the categorization process. The syntax is very similar to that
used to define the `analyzer` in the {ref}/analyze.html[Analyze endpoint].
The `categorization_analyzer` field can be specified either as a string or as
an object.
If it is a string it must refer to a
{ref}/analysis-analyzers.html[built-in analyzer] or one added by
another plugin.
If it is an object it has the following properties:
`char_filter`::
(array of strings or objects) One or more
{ref}/analysis-charfilters.html[character filters]. In addition
to the built-in character filters other plugins may provide more. This property
is optional. If not specified then there will be no character filters. If
you are customizing some other aspect of the analyzer and need to achieve
the equivalent of `categorization_filters` (which are not permitted when some
other aspect of the analyzer is customized), add them here as
{ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
`tokenizer`::
(string or object) The name or definition of the
{ref}/analysis-tokenizers.html[tokenizer] to use after character
filters have been applied. This property is compulsory if `categorization_analyzer`
is specified as an object. Machine learning provides a tokenizer called `ml_classic`
that tokenizes in the same way as the non-customizable tokenizer in older versions of
the product. If you would like to stick with this but change the character or token
filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
`filter`::
(array of strings or objects) One or more
{ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
filters other plugins may provide more. This property is optional. If not specified
then there will be no token filters.
If you omit `categorization_analyzer` entirely then the default that will be used is
the one from the following job:
[source,js]
--------------------------------------------------
POST _xpack/ml/anomaly_detectors/_validate
{
"analysis_config" : {
"categorization_analyzer" : {
"tokenizer" : "ml_classic",
"filter" : [
{ "type" : "stop", "stopwords": [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] }
]
},
"categorization_field_name": "message",
"detectors" :[{
"function":"count",
"by_field_name": "mlcategory"
}]
},
"data_description" : {
}
}
--------------------------------------------------
// CONSOLE
However, if you specify any part of `categorization_analyzer` then any omitted
sub-properties are _not_ defaulted.
If you are categorizing non-English messages in a language where words are separated
by spaces you may get better results if you change the day/month words in the stop
token filter to those from your language. If you are categorizing messages in a language
where words are not separated by spaces then you will need to use a different tokenizer
as well in order to get sensible categorization results.
It is important to be aware that analyzing for categorization of machine generated
log messages is a little different to tokenizing for search. Features that work well
for search, such as stemming, synonym substitution and lowercasing are likely to make
the results of categorization worse. However, in order for drilldown from machine
learning results to work correctly, the tokens that the categorization analyzer
produces need to be sufficiently similar to those produced by the search analyzer
that if you search for the tokens that the categorization analyzer produces you will
find the original document that the field to be categorized came from.
For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
//<<ml-configuring-categories>>.
[float]
[[ml-apilimits]]
==== Analysis Limits