diff --git a/docs/en/ml/categories.asciidoc b/docs/en/ml/categories.asciidoc index d2c1ac2503f..8a7114e418e 100644 --- a/docs/en/ml/categories.asciidoc +++ b/docs/en/ml/categories.asciidoc @@ -79,6 +79,129 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat NOTE: To add the `categorization_examples_limit` property, you must use the **Edit JSON** tab and copy the `analysis_limits` object from the API example. +It is possible to customize the way the categorization field values are interpreted +to an even greater extent: + +[source,js] +---------------------------------- +PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2 +{ + "description" : "IT Ops Application Logs", + "analysis_config" : { + "categorization_field_name": "message", + "bucket_span":"30m", + "detectors" :[{ + "function":"count", + "by_field_name": "mlcategory", + "detector_description": "Unusual message counts" + }], + "categorization_analyzer":{ + "char_filter": [ + { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1> + ], + "tokenizer": "ml_classic", <2> + "filter": [ + { "type" : "stop", "stopwords": [ + "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", + "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun", + "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", + "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", + "GMT", "UTC" + ] } <3> + ] + } + }, + "analysis_limits":{ + "categorization_examples_limit": 5 + }, + "data_description" : { + "time_field":"time", + "time_format": "epoch_ms" + } +} +---------------------------------- +//CONSOLE +<1> The +{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter] +here achieves exactly the same as the `categorization_filters` in the first +example. +<2> The `ml_classic` tokenizer works like the non-customizable tokenization +that was used for categorization in older versions of machine learning. Use +it if you want the same categorization behavior as older versions. +<3> English day/month words are filtered by default from log messages +before categorization. If your logs are in a different language and contain +dates then you may get better results by filtering day/month words in your +language. + +The optional `categorization_analyzer` property allows even greater customization +of how categorization interprets the categorization field value. It can refer to +a built-in Elasticsearch analyzer, or a combination of zero or more character +filters, a tokenizer, and zero or more token filters. + +The `ml_classic` tokenizer and the day/month stopword filter are more-or-less +equivalent to the following analyzer defined using only built-in Elasticsearch +{ref}/analysis-tokenizers.html[tokenizers] and +{ref}/analysis-tokenfilters.html[token filters]: + +[source,js] +---------------------------------- +PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3 +{ + "description" : "IT Ops Application Logs", + "analysis_config" : { + "categorization_field_name": "message", + "bucket_span":"30m", + "detectors" :[{ + "function":"count", + "by_field_name": "mlcategory", + "detector_description": "Unusual message counts" + }], + "categorization_analyzer":{ + "tokenizer": { + "type" : "simple_pattern_split", + "pattern" : "[^-0-9A-Za-z_.]+" <1> + }, + "filter": [ + { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2> + { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3> + { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4> + { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5> + { "type" : "stop", "stopwords": [ + "", + "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", + "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun", + "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", + "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", + "GMT", "UTC" + ] } + ] + } + }, + "analysis_limits":{ + "categorization_examples_limit": 5 + }, + "data_description" : { + "time_field":"time", + "time_format": "epoch_ms" + } +} +---------------------------------- +//CONSOLE +<1> Tokens basically consist of hyphens, digits, letters, underscores and dots. +<2> By default categorization ignores tokens that begin with a digit. +<3> By default categorization also ignores tokens that are hexadecimal numbers. +<4> Underscores, hypens and dots are removed from the beginning of tokens. +<5> Also at the end of tokens. + +The key difference between the default `categorization_analyzer` and this example +analyzer is that using the `ml_classic` tokenizer is several times faster. (The +difference in behavior is that this custom analyzer does not include accented +letters in tokens whereas the `ml_classic` tokenizer will, although that could be +fixed by using more complex regular expressions.) + +NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON** +tab and copy the `categorization_analyzer` object from one of the API examples above. + After you open the job and start the {dfeed} or supply data to the job, you can view the results in {kib}. For example: diff --git a/docs/en/ml/limitations.asciidoc b/docs/en/ml/limitations.asciidoc index 1d8eda58bdf..242d4be4431 100644 --- a/docs/en/ml/limitations.asciidoc +++ b/docs/en/ml/limitations.asciidoc @@ -5,11 +5,15 @@ The following limitations and known problems apply to the {version} release of {xpack}: [float] -=== Categorization uses English tokenization rules and dictionary words +=== Categorization uses English dictionary words //See x-pack-elasticsearch/#3021 Categorization identifies static parts of unstructured logs and groups similar -messages together. This is currently supported only for English language log -messages. +messages together. The default categorization tokenizer assumes English language +log messages. For other languages you must define a different +categorization_analyzer for your job. Additionally, a dictionary used to influence +the categorization process contains only English words. This means categorization +may work better in English than in other languages. The ability to customize the +dictionary will be added in a future release. [float] === Pop-ups must be enabled in browsers diff --git a/docs/en/rest-api/ml/jobresource.asciidoc b/docs/en/rest-api/ml/jobresource.asciidoc index 7f20830ea23..2f3026fff08 100644 --- a/docs/en/rest-api/ml/jobresource.asciidoc +++ b/docs/en/rest-api/ml/jobresource.asciidoc @@ -110,6 +110,18 @@ An analysis configuration object has the following properties: consideration for defining categories. For example, you can exclude SQL statements that appear in your log files. For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages]. + This property cannot be used at the same time as `categorization_analyzer`. + If you only want to define simple regular expression filters to be applied + prior to tokenization then it is easiest to specify them using this property. + If you also want to customize the tokenizer or post-tokenization filtering + then these filters must be included in the `categorization_analyzer` as + `pattern_replace` `char_filter`s. The effect is exactly the same. +//<>. + +`categorization_analyzer`:: + (object or string) If `categorization_field_name` is specified, + you can also define the analyzer that will be used to interpret the field + to be categorized. See <>. //<>. `detectors`:: @@ -293,6 +305,102 @@ job creation fails. -- +[float] +[[ml-categorizationanalyzer]] +==== Categorization Analyzer + +The categorization analyzer specifies how the `categorization_field` will be +interpreted by the categorization process. The syntax is very similar to that +used to define the `analyzer` in the {ref}/analyze.html[Analyze endpoint]. + +The `categorization_analyzer` field can be specified either as a string or as +an object. + +If it is a string it must refer to a +{ref}/analysis-analyzers.html[built-in analyzer] or one added by +another plugin. + +If it is an object it has the following properties: + +`char_filter`:: + (array of strings or objects) One or more + {ref}/analysis-charfilters.html[character filters]. In addition + to the built-in character filters other plugins may provide more. This property + is optional. If not specified then there will be no character filters. If + you are customizing some other aspect of the analyzer and need to achieve + the equivalent of `categorization_filters` (which are not permitted when some + other aspect of the analyzer is customized), add them here as + {ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters]. + +`tokenizer`:: + (string or object) The name or definition of the + {ref}/analysis-tokenizers.html[tokenizer] to use after character + filters have been applied. This property is compulsory if `categorization_analyzer` + is specified as an object. Machine learning provides a tokenizer called `ml_classic` + that tokenizes in the same way as the non-customizable tokenizer in older versions of + the product. If you would like to stick with this but change the character or token + filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`. + +`filter`:: + (array of strings or objects) One or more + {ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token + filters other plugins may provide more. This property is optional. If not specified + then there will be no token filters. + +If you omit `categorization_analyzer` entirely then the default that will be used is +the one from the following job: + +[source,js] +-------------------------------------------------- +POST _xpack/ml/anomaly_detectors/_validate +{ + "analysis_config" : { + "categorization_analyzer" : { + "tokenizer" : "ml_classic", + "filter" : [ + { "type" : "stop", "stopwords": [ + "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", + "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun", + "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", + "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", + "GMT", "UTC" + ] } + ] + }, + "categorization_field_name": "message", + "detectors" :[{ + "function":"count", + "by_field_name": "mlcategory" + }] + }, + "data_description" : { + } +} +-------------------------------------------------- +// CONSOLE + +However, if you specify any part of `categorization_analyzer` then any omitted +sub-properties are _not_ defaulted. + +If you are categorizing non-English messages in a language where words are separated +by spaces you may get better results if you change the day/month words in the stop +token filter to those from your language. If you are categorizing messages in a language +where words are not separated by spaces then you will need to use a different tokenizer +as well in order to get sensible categorization results. + +It is important to be aware that analyzing for categorization of machine generated +log messages is a little different to tokenizing for search. Features that work well +for search, such as stemming, synonym substitution and lowercasing are likely to make +the results of categorization worse. However, in order for drilldown from machine +learning results to work correctly, the tokens that the categorization analyzer +produces need to be sufficiently similar to those produced by the search analyzer +that if you search for the tokens that the categorization analyzer produces you will +find the original document that the field to be categorized came from. + +For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages]. +//<>. + + [float] [[ml-apilimits]] ==== Analysis Limits