[DOCS] Add documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3554)
This is the documentation for the changes made in elastic/x-pack-elasticsearch#3372. Relates elastic/machine-learning-cpp#491 Original commit: elastic/x-pack-elasticsearch@7d67e9d894
This commit is contained in:
parent
d4cddc12d0
commit
e9dafbd78d
|
@ -79,6 +79,129 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
|
|||
NOTE: To add the `categorization_examples_limit` property, you must use the
|
||||
**Edit JSON** tab and copy the `analysis_limits` object from the API example.
|
||||
|
||||
It is possible to customize the way the categorization field values are interpreted
|
||||
to an even greater extent:
|
||||
|
||||
[source,js]
|
||||
----------------------------------
|
||||
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
|
||||
{
|
||||
"description" : "IT Ops Application Logs",
|
||||
"analysis_config" : {
|
||||
"categorization_field_name": "message",
|
||||
"bucket_span":"30m",
|
||||
"detectors" :[{
|
||||
"function":"count",
|
||||
"by_field_name": "mlcategory",
|
||||
"detector_description": "Unusual message counts"
|
||||
}],
|
||||
"categorization_analyzer":{
|
||||
"char_filter": [
|
||||
{ "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
|
||||
],
|
||||
"tokenizer": "ml_classic", <2>
|
||||
"filter": [
|
||||
{ "type" : "stop", "stopwords": [
|
||||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||
"GMT", "UTC"
|
||||
] } <3>
|
||||
]
|
||||
}
|
||||
},
|
||||
"analysis_limits":{
|
||||
"categorization_examples_limit": 5
|
||||
},
|
||||
"data_description" : {
|
||||
"time_field":"time",
|
||||
"time_format": "epoch_ms"
|
||||
}
|
||||
}
|
||||
----------------------------------
|
||||
//CONSOLE
|
||||
<1> The
|
||||
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
|
||||
here achieves exactly the same as the `categorization_filters` in the first
|
||||
example.
|
||||
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
|
||||
that was used for categorization in older versions of machine learning. Use
|
||||
it if you want the same categorization behavior as older versions.
|
||||
<3> English day/month words are filtered by default from log messages
|
||||
before categorization. If your logs are in a different language and contain
|
||||
dates then you may get better results by filtering day/month words in your
|
||||
language.
|
||||
|
||||
The optional `categorization_analyzer` property allows even greater customization
|
||||
of how categorization interprets the categorization field value. It can refer to
|
||||
a built-in Elasticsearch analyzer, or a combination of zero or more character
|
||||
filters, a tokenizer, and zero or more token filters.
|
||||
|
||||
The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
|
||||
equivalent to the following analyzer defined using only built-in Elasticsearch
|
||||
{ref}/analysis-tokenizers.html[tokenizers] and
|
||||
{ref}/analysis-tokenfilters.html[token filters]:
|
||||
|
||||
[source,js]
|
||||
----------------------------------
|
||||
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
|
||||
{
|
||||
"description" : "IT Ops Application Logs",
|
||||
"analysis_config" : {
|
||||
"categorization_field_name": "message",
|
||||
"bucket_span":"30m",
|
||||
"detectors" :[{
|
||||
"function":"count",
|
||||
"by_field_name": "mlcategory",
|
||||
"detector_description": "Unusual message counts"
|
||||
}],
|
||||
"categorization_analyzer":{
|
||||
"tokenizer": {
|
||||
"type" : "simple_pattern_split",
|
||||
"pattern" : "[^-0-9A-Za-z_.]+" <1>
|
||||
},
|
||||
"filter": [
|
||||
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
|
||||
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
|
||||
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
|
||||
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
|
||||
{ "type" : "stop", "stopwords": [
|
||||
"",
|
||||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||
"GMT", "UTC"
|
||||
] }
|
||||
]
|
||||
}
|
||||
},
|
||||
"analysis_limits":{
|
||||
"categorization_examples_limit": 5
|
||||
},
|
||||
"data_description" : {
|
||||
"time_field":"time",
|
||||
"time_format": "epoch_ms"
|
||||
}
|
||||
}
|
||||
----------------------------------
|
||||
//CONSOLE
|
||||
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
|
||||
<2> By default categorization ignores tokens that begin with a digit.
|
||||
<3> By default categorization also ignores tokens that are hexadecimal numbers.
|
||||
<4> Underscores, hypens and dots are removed from the beginning of tokens.
|
||||
<5> Also at the end of tokens.
|
||||
|
||||
The key difference between the default `categorization_analyzer` and this example
|
||||
analyzer is that using the `ml_classic` tokenizer is several times faster. (The
|
||||
difference in behavior is that this custom analyzer does not include accented
|
||||
letters in tokens whereas the `ml_classic` tokenizer will, although that could be
|
||||
fixed by using more complex regular expressions.)
|
||||
|
||||
NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
|
||||
tab and copy the `categorization_analyzer` object from one of the API examples above.
|
||||
|
||||
|
||||
After you open the job and start the {dfeed} or supply data to the job, you can
|
||||
view the results in {kib}. For example:
|
||||
|
|
|
@ -5,11 +5,15 @@ The following limitations and known problems apply to the {version} release of
|
|||
{xpack}:
|
||||
|
||||
[float]
|
||||
=== Categorization uses English tokenization rules and dictionary words
|
||||
=== Categorization uses English dictionary words
|
||||
//See x-pack-elasticsearch/#3021
|
||||
Categorization identifies static parts of unstructured logs and groups similar
|
||||
messages together. This is currently supported only for English language log
|
||||
messages.
|
||||
messages together. The default categorization tokenizer assumes English language
|
||||
log messages. For other languages you must define a different
|
||||
categorization_analyzer for your job. Additionally, a dictionary used to influence
|
||||
the categorization process contains only English words. This means categorization
|
||||
may work better in English than in other languages. The ability to customize the
|
||||
dictionary will be added in a future release.
|
||||
|
||||
[float]
|
||||
=== Pop-ups must be enabled in browsers
|
||||
|
|
|
@ -110,6 +110,18 @@ An analysis configuration object has the following properties:
|
|||
consideration for defining categories. For example, you can exclude SQL
|
||||
statements that appear in your log files. For more information, see
|
||||
{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
||||
This property cannot be used at the same time as `categorization_analyzer`.
|
||||
If you only want to define simple regular expression filters to be applied
|
||||
prior to tokenization then it is easiest to specify them using this property.
|
||||
If you also want to customize the tokenizer or post-tokenization filtering
|
||||
then these filters must be included in the `categorization_analyzer` as
|
||||
`pattern_replace` `char_filter`s. The effect is exactly the same.
|
||||
//<<ml-configuring-categories>>.
|
||||
|
||||
`categorization_analyzer`::
|
||||
(object or string) If `categorization_field_name` is specified,
|
||||
you can also define the analyzer that will be used to interpret the field
|
||||
to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
|
||||
//<<ml-configuring-categories>>.
|
||||
|
||||
`detectors`::
|
||||
|
@ -293,6 +305,102 @@ job creation fails.
|
|||
|
||||
--
|
||||
|
||||
[float]
|
||||
[[ml-categorizationanalyzer]]
|
||||
==== Categorization Analyzer
|
||||
|
||||
The categorization analyzer specifies how the `categorization_field` will be
|
||||
interpreted by the categorization process. The syntax is very similar to that
|
||||
used to define the `analyzer` in the {ref}/analyze.html[Analyze endpoint].
|
||||
|
||||
The `categorization_analyzer` field can be specified either as a string or as
|
||||
an object.
|
||||
|
||||
If it is a string it must refer to a
|
||||
{ref}/analysis-analyzers.html[built-in analyzer] or one added by
|
||||
another plugin.
|
||||
|
||||
If it is an object it has the following properties:
|
||||
|
||||
`char_filter`::
|
||||
(array of strings or objects) One or more
|
||||
{ref}/analysis-charfilters.html[character filters]. In addition
|
||||
to the built-in character filters other plugins may provide more. This property
|
||||
is optional. If not specified then there will be no character filters. If
|
||||
you are customizing some other aspect of the analyzer and need to achieve
|
||||
the equivalent of `categorization_filters` (which are not permitted when some
|
||||
other aspect of the analyzer is customized), add them here as
|
||||
{ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
|
||||
|
||||
`tokenizer`::
|
||||
(string or object) The name or definition of the
|
||||
{ref}/analysis-tokenizers.html[tokenizer] to use after character
|
||||
filters have been applied. This property is compulsory if `categorization_analyzer`
|
||||
is specified as an object. Machine learning provides a tokenizer called `ml_classic`
|
||||
that tokenizes in the same way as the non-customizable tokenizer in older versions of
|
||||
the product. If you would like to stick with this but change the character or token
|
||||
filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
|
||||
|
||||
`filter`::
|
||||
(array of strings or objects) One or more
|
||||
{ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
|
||||
filters other plugins may provide more. This property is optional. If not specified
|
||||
then there will be no token filters.
|
||||
|
||||
If you omit `categorization_analyzer` entirely then the default that will be used is
|
||||
the one from the following job:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
POST _xpack/ml/anomaly_detectors/_validate
|
||||
{
|
||||
"analysis_config" : {
|
||||
"categorization_analyzer" : {
|
||||
"tokenizer" : "ml_classic",
|
||||
"filter" : [
|
||||
{ "type" : "stop", "stopwords": [
|
||||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||
"GMT", "UTC"
|
||||
] }
|
||||
]
|
||||
},
|
||||
"categorization_field_name": "message",
|
||||
"detectors" :[{
|
||||
"function":"count",
|
||||
"by_field_name": "mlcategory"
|
||||
}]
|
||||
},
|
||||
"data_description" : {
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
However, if you specify any part of `categorization_analyzer` then any omitted
|
||||
sub-properties are _not_ defaulted.
|
||||
|
||||
If you are categorizing non-English messages in a language where words are separated
|
||||
by spaces you may get better results if you change the day/month words in the stop
|
||||
token filter to those from your language. If you are categorizing messages in a language
|
||||
where words are not separated by spaces then you will need to use a different tokenizer
|
||||
as well in order to get sensible categorization results.
|
||||
|
||||
It is important to be aware that analyzing for categorization of machine generated
|
||||
log messages is a little different to tokenizing for search. Features that work well
|
||||
for search, such as stemming, synonym substitution and lowercasing are likely to make
|
||||
the results of categorization worse. However, in order for drilldown from machine
|
||||
learning results to work correctly, the tokens that the categorization analyzer
|
||||
produces need to be sufficiently similar to those produced by the search analyzer
|
||||
that if you search for the tokens that the categorization analyzer produces you will
|
||||
find the original document that the field to be categorized came from.
|
||||
|
||||
For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
|
||||
//<<ml-configuring-categories>>.
|
||||
|
||||
|
||||
[float]
|
||||
[[ml-apilimits]]
|
||||
==== Analysis Limits
|
||||
|
|
Loading…
Reference in New Issue