[DOCS] Add documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3554)

This is the documentation for the changes made in elastic/x-pack-elasticsearch#3372. Relates elastic/machine-learning-cpp#491 Original commit: elastic/x-pack-elasticsearch@7d67e9d894
2018-01-15 15:47:19 +00:00 · 2018-01-15 15:47:19 +00:00 · e9dafbd78d
parent d4cddc12d0
commit e9dafbd78d
3 changed files with 238 additions and 3 deletions
--- a/docs/en/ml/categories.asciidoc
+++ b/docs/en/ml/categories.asciidoc
@ -79,6 +79,129 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
 NOTE: To add the `categorization_examples_limit` property, you must use the
 **Edit JSON** tab and copy the `analysis_limits` object from the API example.
 It is possible to customize the way the categorization field values are interpreted
 to an even greater extent:
 [source,js]
 ----------------------------------
 PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
 {
  "description" : "IT Ops Application Logs",
  "analysis_config" : {
    "categorization_field_name": "message",
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory",
      "detector_description": "Unusual message counts"
    }],
    "categorization_analyzer":{
      "char_filter": [
        { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
      ],
      "tokenizer": "ml_classic", <2>
      "filter": [
        { "type" : "stop", "stopwords": [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] } <3>
      ]
    }
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
    "time_field":"time",
    "time_format": "epoch_ms"
  }
 }
 ----------------------------------
 //CONSOLE
 <1> The
 {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
 here achieves exactly the same as the `categorization_filters` in the first
 example.
 <2> The `ml_classic` tokenizer works like the non-customizable tokenization
 that was used for categorization in older versions of machine learning. Use
 it if you want the same categorization behavior as older versions.
 <3> English day/month words are filtered by default from log messages
 before categorization. If your logs are in a different language and contain
 dates then you may get better results by filtering day/month words in your
 language.
 The optional `categorization_analyzer` property allows even greater customization
 of how categorization interprets the categorization field value. It can refer to
 a built-in Elasticsearch analyzer, or a combination of zero or more character
 filters, a tokenizer, and zero or more token filters.
 The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
 equivalent to the following analyzer defined using only built-in Elasticsearch
 {ref}/analysis-tokenizers.html[tokenizers] and
 {ref}/analysis-tokenfilters.html[token filters]:
 [source,js]
 ----------------------------------
 PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
 {
  "description" : "IT Ops Application Logs",
  "analysis_config" : {
    "categorization_field_name": "message",
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory",
      "detector_description": "Unusual message counts"
    }],
    "categorization_analyzer":{
      "tokenizer": {
        "type" : "simple_pattern_split",
        "pattern" : "[^-0-9A-Za-z_.]+" <1>
      },
      "filter": [
        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
        { "type" : "stop", "stopwords": [
          "",
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
      ]
    }
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
    "time_field":"time",
    "time_format": "epoch_ms"
  }
 }
 ----------------------------------
 //CONSOLE
 <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
 <2> By default categorization ignores tokens that begin with a digit.
 <3> By default categorization also ignores tokens that are hexadecimal numbers.
 <4> Underscores, hypens and dots are removed from the beginning of tokens.
 <5> Also at the end of tokens.
 The key difference between the default `categorization_analyzer` and this example
 analyzer is that using the `ml_classic` tokenizer is several times faster. (The
 difference in behavior is that this custom analyzer does not include accented
 letters in tokens whereas the `ml_classic` tokenizer will, although that could be
 fixed by using more complex regular expressions.)
 NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
 tab and copy the `categorization_analyzer` object from one of the API examples above.
 After you open the job and start the {dfeed} or supply data to the job, you can
 view the results in {kib}. For example:
--- a/docs/en/ml/limitations.asciidoc
+++ b/docs/en/ml/limitations.asciidoc
@ -5,11 +5,15 @@ The following limitations and known problems apply to the {version} release of
 {xpack}:
 [float]
-=== Categorization uses English tokenization rules and dictionary words
+=== Categorization uses English dictionary words
 //See x-pack-elasticsearch/#3021
 Categorization identifies static parts of unstructured logs and groups similar
-messages together. This is currently supported only for English language log
+messages together. The default categorization tokenizer assumes English language
-messages.
+log messages. For other languages you must define a different
 categorization_analyzer for your job. Additionally, a dictionary used to influence
 the categorization process contains only English words. This means categorization
 may work better in English than in other languages. The ability to customize the
 dictionary will be added in a future release.
 [float]
 === Pop-ups must be enabled in browsers
--- a/docs/en/rest-api/ml/jobresource.asciidoc
+++ b/docs/en/rest-api/ml/jobresource.asciidoc
@ -110,6 +110,18 @@ An analysis configuration object has the following properties:
  consideration for defining categories. For example, you can exclude SQL
  statements that appear in your log files. For more information, see
  {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
  This property cannot be used at the same time as `categorization_analyzer`.
  If you only want to define simple regular expression filters to be applied
  prior to tokenization then it is easiest to specify them using this property.
  If you also want to customize the tokenizer or post-tokenization filtering
  then these filters must be included in the `categorization_analyzer` as
  `pattern_replace` `char_filter`s. The effect is exactly the same.
 //<<ml-configuring-categories>>.
 `categorization_analyzer`::
  (object or string) If `categorization_field_name` is specified,
  you can also define the analyzer that will be used to interpret the field
  to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
 //<<ml-configuring-categories>>.
 `detectors`::
@ -293,6 +305,102 @@ job creation fails.
 --
 [float]
 [[ml-categorizationanalyzer]]
 ==== Categorization Analyzer
 The categorization analyzer specifies how the `categorization_field` will be
 interpreted by the categorization process. The syntax is very similar to that
 used to define the `analyzer` in the {ref}/analyze.html[Analyze endpoint].
 The `categorization_analyzer` field can be specified either as a string or as
 an object.
 If it is a string it must refer to a
 {ref}/analysis-analyzers.html[built-in analyzer] or one added by
 another plugin.
 If it is an object it has the following properties:
 `char_filter`::
  (array of strings or objects) One or more
  {ref}/analysis-charfilters.html[character filters]. In addition
  to the built-in character filters other plugins may provide more. This property
  is optional. If not specified then there will be no character filters. If
  you are customizing some other aspect of the analyzer and need to achieve
  the equivalent of `categorization_filters` (which are not permitted when some
  other aspect of the analyzer is customized), add them here as
  {ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
 `tokenizer`::
  (string or object) The name or definition of the
  {ref}/analysis-tokenizers.html[tokenizer] to use after character
  filters have been applied. This property is compulsory if `categorization_analyzer`
  is specified as an object. Machine learning provides a tokenizer called `ml_classic`
  that tokenizes in the same way as the non-customizable tokenizer in older versions of
  the product. If you would like to stick with this but change the character or token
  filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
 `filter`::
  (array of strings or objects) One or more
  {ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
  filters other plugins may provide more. This property is optional. If not specified
  then there will be no token filters.
 If you omit `categorization_analyzer` entirely then the default that will be used is
 the one from the following job:
 [source,js]
 --------------------------------------------------
 POST _xpack/ml/anomaly_detectors/_validate
 {
  "analysis_config" : {
    "categorization_analyzer" : {
      "tokenizer" : "ml_classic",
      "filter" : [
        { "type" : "stop", "stopwords": [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
      ]
    },
    "categorization_field_name": "message",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory"
    }]
  },
  "data_description" : {
  }
 }
 --------------------------------------------------
 // CONSOLE
 However, if you specify any part of `categorization_analyzer` then any omitted
 sub-properties are _not_ defaulted.
 If you are categorizing non-English messages in a language where words are separated
 by spaces you may get better results if you change the day/month words in the stop
 token filter to those from your language. If you are categorizing messages in a language
 where words are not separated by spaces then you will need to use a different tokenizer
 as well in order to get sensible categorization results.
 It is important to be aware that analyzing for categorization of machine generated
 log messages is a little different to tokenizing for search. Features that work well
 for search, such as stemming, synonym substitution and lowercasing are likely to make
 the results of categorization worse. However, in order for drilldown from machine
 learning results to work correctly, the tokens that the categorization analyzer
 produces need to be sufficiently similar to those produced by the search analyzer
 that if you search for the tokens that the categorization analyzer produces you will
 find the original document that the field to be categorized came from.
 For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
 //<<ml-configuring-categories>>.
 [float]
 [[ml-apilimits]]
 ==== Analysis Limits