[DOCS] Add documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3554)
This is the documentation for the changes made in elastic/x-pack-elasticsearch#3372. Relates elastic/machine-learning-cpp#491 Original commit: elastic/x-pack-elasticsearch@7d67e9d894
This commit is contained in:
parent
d4cddc12d0
commit
e9dafbd78d
|
@ -79,6 +79,129 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
|
||||||
NOTE: To add the `categorization_examples_limit` property, you must use the
|
NOTE: To add the `categorization_examples_limit` property, you must use the
|
||||||
**Edit JSON** tab and copy the `analysis_limits` object from the API example.
|
**Edit JSON** tab and copy the `analysis_limits` object from the API example.
|
||||||
|
|
||||||
|
It is possible to customize the way the categorization field values are interpreted
|
||||||
|
to an even greater extent:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
----------------------------------
|
||||||
|
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
|
||||||
|
{
|
||||||
|
"description" : "IT Ops Application Logs",
|
||||||
|
"analysis_config" : {
|
||||||
|
"categorization_field_name": "message",
|
||||||
|
"bucket_span":"30m",
|
||||||
|
"detectors" :[{
|
||||||
|
"function":"count",
|
||||||
|
"by_field_name": "mlcategory",
|
||||||
|
"detector_description": "Unusual message counts"
|
||||||
|
}],
|
||||||
|
"categorization_analyzer":{
|
||||||
|
"char_filter": [
|
||||||
|
{ "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
|
||||||
|
],
|
||||||
|
"tokenizer": "ml_classic", <2>
|
||||||
|
"filter": [
|
||||||
|
{ "type" : "stop", "stopwords": [
|
||||||
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||||
|
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||||
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||||
|
"GMT", "UTC"
|
||||||
|
] } <3>
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analysis_limits":{
|
||||||
|
"categorization_examples_limit": 5
|
||||||
|
},
|
||||||
|
"data_description" : {
|
||||||
|
"time_field":"time",
|
||||||
|
"time_format": "epoch_ms"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----------------------------------
|
||||||
|
//CONSOLE
|
||||||
|
<1> The
|
||||||
|
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
|
||||||
|
here achieves exactly the same as the `categorization_filters` in the first
|
||||||
|
example.
|
||||||
|
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
|
||||||
|
that was used for categorization in older versions of machine learning. Use
|
||||||
|
it if you want the same categorization behavior as older versions.
|
||||||
|
<3> English day/month words are filtered by default from log messages
|
||||||
|
before categorization. If your logs are in a different language and contain
|
||||||
|
dates then you may get better results by filtering day/month words in your
|
||||||
|
language.
|
||||||
|
|
||||||
|
The optional `categorization_analyzer` property allows even greater customization
|
||||||
|
of how categorization interprets the categorization field value. It can refer to
|
||||||
|
a built-in Elasticsearch analyzer, or a combination of zero or more character
|
||||||
|
filters, a tokenizer, and zero or more token filters.
|
||||||
|
|
||||||
|
The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
|
||||||
|
equivalent to the following analyzer defined using only built-in Elasticsearch
|
||||||
|
{ref}/analysis-tokenizers.html[tokenizers] and
|
||||||
|
{ref}/analysis-tokenfilters.html[token filters]:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
----------------------------------
|
||||||
|
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
|
||||||
|
{
|
||||||
|
"description" : "IT Ops Application Logs",
|
||||||
|
"analysis_config" : {
|
||||||
|
"categorization_field_name": "message",
|
||||||
|
"bucket_span":"30m",
|
||||||
|
"detectors" :[{
|
||||||
|
"function":"count",
|
||||||
|
"by_field_name": "mlcategory",
|
||||||
|
"detector_description": "Unusual message counts"
|
||||||
|
}],
|
||||||
|
"categorization_analyzer":{
|
||||||
|
"tokenizer": {
|
||||||
|
"type" : "simple_pattern_split",
|
||||||
|
"pattern" : "[^-0-9A-Za-z_.]+" <1>
|
||||||
|
},
|
||||||
|
"filter": [
|
||||||
|
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
|
||||||
|
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
|
||||||
|
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
|
||||||
|
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
|
||||||
|
{ "type" : "stop", "stopwords": [
|
||||||
|
"",
|
||||||
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||||
|
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||||
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||||
|
"GMT", "UTC"
|
||||||
|
] }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analysis_limits":{
|
||||||
|
"categorization_examples_limit": 5
|
||||||
|
},
|
||||||
|
"data_description" : {
|
||||||
|
"time_field":"time",
|
||||||
|
"time_format": "epoch_ms"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----------------------------------
|
||||||
|
//CONSOLE
|
||||||
|
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
|
||||||
|
<2> By default categorization ignores tokens that begin with a digit.
|
||||||
|
<3> By default categorization also ignores tokens that are hexadecimal numbers.
|
||||||
|
<4> Underscores, hypens and dots are removed from the beginning of tokens.
|
||||||
|
<5> Also at the end of tokens.
|
||||||
|
|
||||||
|
The key difference between the default `categorization_analyzer` and this example
|
||||||
|
analyzer is that using the `ml_classic` tokenizer is several times faster. (The
|
||||||
|
difference in behavior is that this custom analyzer does not include accented
|
||||||
|
letters in tokens whereas the `ml_classic` tokenizer will, although that could be
|
||||||
|
fixed by using more complex regular expressions.)
|
||||||
|
|
||||||
|
NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
|
||||||
|
tab and copy the `categorization_analyzer` object from one of the API examples above.
|
||||||
|
|
||||||
|
|
||||||
After you open the job and start the {dfeed} or supply data to the job, you can
|
After you open the job and start the {dfeed} or supply data to the job, you can
|
||||||
view the results in {kib}. For example:
|
view the results in {kib}. For example:
|
||||||
|
|
|
@ -5,11 +5,15 @@ The following limitations and known problems apply to the {version} release of
|
||||||
{xpack}:
|
{xpack}:
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
=== Categorization uses English tokenization rules and dictionary words
|
=== Categorization uses English dictionary words
|
||||||
//See x-pack-elasticsearch/#3021
|
//See x-pack-elasticsearch/#3021
|
||||||
Categorization identifies static parts of unstructured logs and groups similar
|
Categorization identifies static parts of unstructured logs and groups similar
|
||||||
messages together. This is currently supported only for English language log
|
messages together. The default categorization tokenizer assumes English language
|
||||||
messages.
|
log messages. For other languages you must define a different
|
||||||
|
categorization_analyzer for your job. Additionally, a dictionary used to influence
|
||||||
|
the categorization process contains only English words. This means categorization
|
||||||
|
may work better in English than in other languages. The ability to customize the
|
||||||
|
dictionary will be added in a future release.
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
=== Pop-ups must be enabled in browsers
|
=== Pop-ups must be enabled in browsers
|
||||||
|
|
|
@ -110,6 +110,18 @@ An analysis configuration object has the following properties:
|
||||||
consideration for defining categories. For example, you can exclude SQL
|
consideration for defining categories. For example, you can exclude SQL
|
||||||
statements that appear in your log files. For more information, see
|
statements that appear in your log files. For more information, see
|
||||||
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
||||||
|
This property cannot be used at the same time as `categorization_analyzer`.
|
||||||
|
If you only want to define simple regular expression filters to be applied
|
||||||
|
prior to tokenization then it is easiest to specify them using this property.
|
||||||
|
If you also want to customize the tokenizer or post-tokenization filtering
|
||||||
|
then these filters must be included in the `categorization_analyzer` as
|
||||||
|
`pattern_replace` `char_filter`s. The effect is exactly the same.
|
||||||
|
//<<ml-configuring-categories>>.
|
||||||
|
|
||||||
|
`categorization_analyzer`::
|
||||||
|
(object or string) If `categorization_field_name` is specified,
|
||||||
|
you can also define the analyzer that will be used to interpret the field
|
||||||
|
to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
|
||||||
//<<ml-configuring-categories>>.
|
//<<ml-configuring-categories>>.
|
||||||
|
|
||||||
`detectors`::
|
`detectors`::
|
||||||
|
@ -293,6 +305,102 @@ job creation fails.
|
||||||
|
|
||||||
--
|
--
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[ml-categorizationanalyzer]]
|
||||||
|
==== Categorization Analyzer
|
||||||
|
|
||||||
|
The categorization analyzer specifies how the `categorization_field` will be
|
||||||
|
interpreted by the categorization process. The syntax is very similar to that
|
||||||
|
used to define the `analyzer` in the {ref}/analyze.html[Analyze endpoint].
|
||||||
|
|
||||||
|
The `categorization_analyzer` field can be specified either as a string or as
|
||||||
|
an object.
|
||||||
|
|
||||||
|
If it is a string it must refer to a
|
||||||
|
{ref}/analysis-analyzers.html[built-in analyzer] or one added by
|
||||||
|
another plugin.
|
||||||
|
|
||||||
|
If it is an object it has the following properties:
|
||||||
|
|
||||||
|
`char_filter`::
|
||||||
|
(array of strings or objects) One or more
|
||||||
|
{ref}/analysis-charfilters.html[character filters]. In addition
|
||||||
|
to the built-in character filters other plugins may provide more. This property
|
||||||
|
is optional. If not specified then there will be no character filters. If
|
||||||
|
you are customizing some other aspect of the analyzer and need to achieve
|
||||||
|
the equivalent of `categorization_filters` (which are not permitted when some
|
||||||
|
other aspect of the analyzer is customized), add them here as
|
||||||
|
{ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
|
||||||
|
|
||||||
|
`tokenizer`::
|
||||||
|
(string or object) The name or definition of the
|
||||||
|
{ref}/analysis-tokenizers.html[tokenizer] to use after character
|
||||||
|
filters have been applied. This property is compulsory if `categorization_analyzer`
|
||||||
|
is specified as an object. Machine learning provides a tokenizer called `ml_classic`
|
||||||
|
that tokenizes in the same way as the non-customizable tokenizer in older versions of
|
||||||
|
the product. If you would like to stick with this but change the character or token
|
||||||
|
filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
|
||||||
|
|
||||||
|
`filter`::
|
||||||
|
(array of strings or objects) One or more
|
||||||
|
{ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
|
||||||
|
filters other plugins may provide more. This property is optional. If not specified
|
||||||
|
then there will be no token filters.
|
||||||
|
|
||||||
|
If you omit `categorization_analyzer` entirely then the default that will be used is
|
||||||
|
the one from the following job:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
--------------------------------------------------
|
||||||
|
POST _xpack/ml/anomaly_detectors/_validate
|
||||||
|
{
|
||||||
|
"analysis_config" : {
|
||||||
|
"categorization_analyzer" : {
|
||||||
|
"tokenizer" : "ml_classic",
|
||||||
|
"filter" : [
|
||||||
|
{ "type" : "stop", "stopwords": [
|
||||||
|
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||||
|
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||||
|
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||||
|
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||||
|
"GMT", "UTC"
|
||||||
|
] }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"categorization_field_name": "message",
|
||||||
|
"detectors" :[{
|
||||||
|
"function":"count",
|
||||||
|
"by_field_name": "mlcategory"
|
||||||
|
}]
|
||||||
|
},
|
||||||
|
"data_description" : {
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
// CONSOLE
|
||||||
|
|
||||||
|
However, if you specify any part of `categorization_analyzer` then any omitted
|
||||||
|
sub-properties are _not_ defaulted.
|
||||||
|
|
||||||
|
If you are categorizing non-English messages in a language where words are separated
|
||||||
|
by spaces you may get better results if you change the day/month words in the stop
|
||||||
|
token filter to those from your language. If you are categorizing messages in a language
|
||||||
|
where words are not separated by spaces then you will need to use a different tokenizer
|
||||||
|
as well in order to get sensible categorization results.
|
||||||
|
|
||||||
|
It is important to be aware that analyzing for categorization of machine generated
|
||||||
|
log messages is a little different to tokenizing for search. Features that work well
|
||||||
|
for search, such as stemming, synonym substitution and lowercasing are likely to make
|
||||||
|
the results of categorization worse. However, in order for drilldown from machine
|
||||||
|
learning results to work correctly, the tokens that the categorization analyzer
|
||||||
|
produces need to be sufficiently similar to those produced by the search analyzer
|
||||||
|
that if you search for the tokens that the categorization analyzer produces you will
|
||||||
|
find the original document that the field to be categorized came from.
|
||||||
|
|
||||||
|
For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
||||||
|
//<<ml-configuring-categories>>.
|
||||||
|
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[ml-apilimits]]
|
[[ml-apilimits]]
|
||||||
==== Analysis Limits
|
==== Analysis Limits
|
||||||
|
|
Loading…
Reference in New Issue