[DOCS] Add documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3554)

This is the documentation for the changes made in elastic/x-pack-elasticsearch#3372. Relates elastic/machine-learning-cpp#491 Original commit: elastic/x-pack-elasticsearch@7d67e9d894
2018-01-15 15:47:19 +00:00 · 2018-01-15 15:47:19 +00:00 · e9dafbd78d
parent d4cddc12d0
commit e9dafbd78d
3 changed files with 238 additions and 3 deletions
--- a/docs/en/ml/categories.asciidoc
+++ b/docs/en/ml/categories.asciidoc
@ -79,6 +79,129 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
 NOTE: To add the `categorization_examples_limit` property, you must use the
 **Edit JSON** tab and copy the `analysis_limits` object from the API example.

+It is possible to customize the way the categorization field values are interpreted
+to an even greater extent:
+
+[source,js]
+----------------------------------
+PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
+{
+  "description" : "IT Ops Application Logs",
+  "analysis_config" : {
+    "categorization_field_name": "message",
+    "bucket_span":"30m",
+    "detectors" :[{
+      "function":"count",
+      "by_field_name": "mlcategory",
+      "detector_description": "Unusual message counts"
+    }],
+    "categorization_analyzer":{
+      "char_filter": [
+        { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
+      ],
+      "tokenizer": "ml_classic", <2>
+      "filter": [
+        { "type" : "stop", "stopwords": [
+          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
+          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
+          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
+          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+          "GMT", "UTC"
+        ] } <3>
+      ]
+    }
+  },
+  "analysis_limits":{
+    "categorization_examples_limit": 5
+  },
+  "data_description" : {
+    "time_field":"time",
+    "time_format": "epoch_ms"
+  }
+}
+----------------------------------
+//CONSOLE
+<1> The
+{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
+here achieves exactly the same as the `categorization_filters` in the first
+example.
+<2> The `ml_classic` tokenizer works like the non-customizable tokenization
+that was used for categorization in older versions of machine learning. Use
+it if you want the same categorization behavior as older versions.
+<3> English day/month words are filtered by default from log messages
+before categorization. If your logs are in a different language and contain
+dates then you may get better results by filtering day/month words in your
+language.
+
+The optional `categorization_analyzer` property allows even greater customization
+of how categorization interprets the categorization field value. It can refer to
+a built-in Elasticsearch analyzer, or a combination of zero or more character
+filters, a tokenizer, and zero or more token filters.
+
+The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
+equivalent to the following analyzer defined using only built-in Elasticsearch
+{ref}/analysis-tokenizers.html[tokenizers] and
+{ref}/analysis-tokenfilters.html[token filters]:
+
+[source,js]
+----------------------------------
+PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
+{
+  "description" : "IT Ops Application Logs",
+  "analysis_config" : {
+    "categorization_field_name": "message",
+    "bucket_span":"30m",
+    "detectors" :[{
+      "function":"count",
+      "by_field_name": "mlcategory",
+      "detector_description": "Unusual message counts"
+    }],
+    "categorization_analyzer":{
+      "tokenizer": {
+        "type" : "simple_pattern_split",
+        "pattern" : "[^-0-9A-Za-z_.]+" <1>
+      },
+      "filter": [
+        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
+        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
+        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
+        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
+        { "type" : "stop", "stopwords": [
+          "",
+          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
+          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
+          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
+          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+          "GMT", "UTC"
+        ] }
+      ]
+    }
+  },
+  "analysis_limits":{
+    "categorization_examples_limit": 5
+  },
+  "data_description" : {
+    "time_field":"time",
+    "time_format": "epoch_ms"
+  }
+}
+----------------------------------
+//CONSOLE
+<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
+<2> By default categorization ignores tokens that begin with a digit.
+<3> By default categorization also ignores tokens that are hexadecimal numbers.
+<4> Underscores, hypens and dots are removed from the beginning of tokens.
+<5> Also at the end of tokens.
+
+The key difference between the default `categorization_analyzer` and this example
+analyzer is that using the `ml_classic` tokenizer is several times faster. (The
+difference in behavior is that this custom analyzer does not include accented
+letters in tokens whereas the `ml_classic` tokenizer will, although that could be
+fixed by using more complex regular expressions.)
+
+NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
+tab and copy the `categorization_analyzer` object from one of the API examples above.
+

 After you open the job and start the {dfeed} or supply data to the job, you can
 view the results in {kib}. For example:
--- a/docs/en/ml/limitations.asciidoc
+++ b/docs/en/ml/limitations.asciidoc
@ -5,11 +5,15 @@ The following limitations and known problems apply to the {version} release of
 {xpack}:

 [float]
-=== Categorization uses English tokenization rules and dictionary words
+=== Categorization uses English dictionary words
 //See x-pack-elasticsearch/#3021
 Categorization identifies static parts of unstructured logs and groups similar
-messages together. This is currently supported only for English language log
-messages.
+messages together. The default categorization tokenizer assumes English language
+log messages. For other languages you must define a different
+categorization_analyzer for your job. Additionally, a dictionary used to influence
+the categorization process contains only English words. This means categorization
+may work better in English than in other languages. The ability to customize the
+dictionary will be added in a future release.

 [float]
 === Pop-ups must be enabled in browsers
--- a/docs/en/rest-api/ml/jobresource.asciidoc
+++ b/docs/en/rest-api/ml/jobresource.asciidoc
@ -110,6 +110,18 @@ An analysis configuration object has the following properties:
  consideration for defining categories. For example, you can exclude SQL
  statements that appear in your log files. For more information, see
  {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
+  This property cannot be used at the same time as `categorization_analyzer`.
+  If you only want to define simple regular expression filters to be applied
+  prior to tokenization then it is easiest to specify them using this property.
+  If you also want to customize the tokenizer or post-tokenization filtering
+  then these filters must be included in the `categorization_analyzer` as
+  `pattern_replace` `char_filter`s. The effect is exactly the same.
+//<<ml-configuring-categories>>.
+
+`categorization_analyzer`::
+  (object or string) If `categorization_field_name` is specified,
+  you can also define the analyzer that will be used to interpret the field
+  to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
 //<<ml-configuring-categories>>.

 `detectors`::
@ -293,6 +305,102 @@ job creation fails.

 --

+[float]
+[[ml-categorizationanalyzer]]
+==== Categorization Analyzer
+
+The categorization analyzer specifies how the `categorization_field` will be
+interpreted by the categorization process. The syntax is very similar to that
+used to define the `analyzer` in the {ref}/analyze.html[Analyze endpoint].
+
+The `categorization_analyzer` field can be specified either as a string or as
+an object.
+
+If it is a string it must refer to a
+{ref}/analysis-analyzers.html[built-in analyzer] or one added by
+another plugin.
+
+If it is an object it has the following properties:
+
+`char_filter`::
+  (array of strings or objects) One or more
+  {ref}/analysis-charfilters.html[character filters]. In addition
+  to the built-in character filters other plugins may provide more. This property
+  is optional. If not specified then there will be no character filters. If
+  you are customizing some other aspect of the analyzer and need to achieve
+  the equivalent of `categorization_filters` (which are not permitted when some
+  other aspect of the analyzer is customized), add them here as
+  {ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
+
+`tokenizer`::
+  (string or object) The name or definition of the
+  {ref}/analysis-tokenizers.html[tokenizer] to use after character
+  filters have been applied. This property is compulsory if `categorization_analyzer`
+  is specified as an object. Machine learning provides a tokenizer called `ml_classic`
+  that tokenizes in the same way as the non-customizable tokenizer in older versions of
+  the product. If you would like to stick with this but change the character or token
+  filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
+
+`filter`::
+  (array of strings or objects) One or more
+  {ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
+  filters other plugins may provide more. This property is optional. If not specified
+  then there will be no token filters.
+
+If you omit `categorization_analyzer` entirely then the default that will be used is
+the one from the following job:
+
+[source,js]
+--------------------------------------------------
+POST _xpack/ml/anomaly_detectors/_validate
+{
+  "analysis_config" : {
+    "categorization_analyzer" : {
+      "tokenizer" : "ml_classic",
+      "filter" : [
+        { "type" : "stop", "stopwords": [
+          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
+          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
+          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
+          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+          "GMT", "UTC"
+        ] }
+      ]
+    },
+    "categorization_field_name": "message",
+    "detectors" :[{
+      "function":"count",
+      "by_field_name": "mlcategory"
+    }]
+  },
+  "data_description" : {
+  }
+}
+--------------------------------------------------
+// CONSOLE
+
+However, if you specify any part of `categorization_analyzer` then any omitted
+sub-properties are _not_ defaulted.
+
+If you are categorizing non-English messages in a language where words are separated
+by spaces you may get better results if you change the day/month words in the stop
+token filter to those from your language. If you are categorizing messages in a language
+where words are not separated by spaces then you will need to use a different tokenizer
+as well in order to get sensible categorization results.
+
+It is important to be aware that analyzing for categorization of machine generated
+log messages is a little different to tokenizing for search. Features that work well
+for search, such as stemming, synonym substitution and lowercasing are likely to make
+the results of categorization worse. However, in order for drilldown from machine
+learning results to work correctly, the tokens that the categorization analyzer
+produces need to be sufficiently similar to those produced by the search analyzer
+that if you search for the tokens that the categorization analyzer produces you will
+find the original document that the field to be categorized came from.
+
+For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
+//<<ml-configuring-categories>>.
+
+
 [float]
 [[ml-apilimits]]
 ==== Analysis Limits