[DOCS] Edited documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3587)

Original commit: elastic/x-pack-elasticsearch@6dd179107a
2025-03-25 17:38:44 +00:00 · 2018-01-17 13:11:36 -08:00 · 2018-01-17 13:11:36 -08:00 · 9f6064f9ac
commit 9f6064f9ac
parent 60d4b7e53e
3 changed files with 98 additions and 82 deletions
--- a/docs/en/ml/categories.asciidoc
+++ b/docs/en/ml/categories.asciidoc
@ -13,10 +13,6 @@ example:
 You can use {ml} to observe the static parts of the message, cluster similar
 messages together, and classify them into message categories.

-NOTE: Categorization uses English tokenization rules and dictionary words in
-order to identify log message categories. As such, only English language log
-messages are supported.
-
 The {ml} model learns what volume and pattern is normal for each category over
 time. You can then detect anomalies and surface rare events or unusual types of
 messages by using count or rare functions. For example:
@ -79,8 +75,17 @@ image::images/ml-category-advanced.jpg["Advanced job configuration options relat
 NOTE: To add the `categorization_examples_limit` property, you must use the
 **Edit JSON** tab and copy the `analysis_limits` object from the API example.

-It is possible to customize the way the categorization field values are interpreted
-to an even greater extent:
+[float]
+[[ml-configuring-analyzer]]
+==== Customizing the Categorization Analyzer
+
+Categorization uses English dictionary words to identify log message categories.
+By default, it also uses English tokenization rules. For this reason, if you use
+the default categorization analyzer, only English language log messages are
+supported, as described in the <<ml-limitations>>.
+
+You can, however, change the tokenization rules by customizing the way the
+categorization field values are interpreted. For example:

 [source,js]
 ----------------------------------
@ -126,20 +131,20 @@ PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
 here achieves exactly the same as the `categorization_filters` in the first
 example.
 <2> The `ml_classic` tokenizer works like the non-customizable tokenization
-that was used for categorization in older versions of machine learning. Use
-it if you want the same categorization behavior as older versions.
-<3> English day/month words are filtered by default from log messages
-before categorization. If your logs are in a different language and contain
-dates then you may get better results by filtering day/month words in your
+that was used for categorization in older versions of machine learning. If you
+want the same categorization behavior as older versions, use this property value.
+<3> By default, English day or month words are filtered from log messages before
+categorization. If your logs are in a different language and contain
+dates, you might get better results by filtering the day or month words in your
 language.

 The optional `categorization_analyzer` property allows even greater customization
 of how categorization interprets the categorization field value. It can refer to
-a built-in Elasticsearch analyzer, or a combination of zero or more character
-filters, a tokenizer, and zero or more token filters.
+a built-in {es} analyzer or a combination of zero or more character filters,
+a tokenizer, and zero or more token filters.

-The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
-equivalent to the following analyzer defined using only built-in Elasticsearch
+The `ml_classic` tokenizer and the day and month stopword filter are more or less
+equivalent to the following analyzer, which is defined using only built-in {es}
 {ref}/analysis-tokenizers.html[tokenizers] and
 {ref}/analysis-tokenfilters.html[token filters]:

@ -188,23 +193,30 @@ PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
 ----------------------------------
 //CONSOLE
 <1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
-<2> By default categorization ignores tokens that begin with a digit.
-<3> By default categorization also ignores tokens that are hexadecimal numbers.
-<4> Underscores, hypens and dots are removed from the beginning of tokens.
-<5> Also at the end of tokens.
+<2> By default, categorization ignores tokens that begin with a digit.
+<3> By default, categorization also ignores tokens that are hexadecimal numbers.
+<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
+<5> Underscores, hyphens, and dots are also removed from the end of tokens.

 The key difference between the default `categorization_analyzer` and this example
-analyzer is that using the `ml_classic` tokenizer is several times faster. (The
+analyzer is that using the `ml_classic` tokenizer is several times faster. The
 difference in behavior is that this custom analyzer does not include accented
-letters in tokens whereas the `ml_classic` tokenizer will, although that could be
-fixed by using more complex regular expressions.)
+letters in tokens whereas the `ml_classic` tokenizer does, although that could
+be fixed by using more complex regular expressions.

-NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
-tab and copy the `categorization_analyzer` object from one of the API examples above.
+For more information about the `categorization_analyzer` property, see
+{ref}/ml-job-resource.html#ml-categorizationanalyzer[Categorization Analyzer].

+NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
+**Edit JSON** tab and copy the `categorization_analyzer` object from one of the
+API examples above.
+
+[float]
+[[ml-viewing-categories]]
+==== Viewing Categorization Results

 After you open the job and start the {dfeed} or supply data to the job, you can
-view the results in {kib}. For example:
+view the categorization results in {kib}. For example:

 [role="screenshot"]
 image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]
--- a/docs/en/ml/limitations.asciidoc
+++ b/docs/en/ml/limitations.asciidoc
@ -10,10 +10,13 @@ The following limitations and known problems apply to the {version} release of
 Categorization identifies static parts of unstructured logs and groups similar
 messages together. The default categorization tokenizer assumes English language
 log messages. For other languages you must define a different
-categorization_analyzer for your job. Additionally, a dictionary used to influence
-the categorization process contains only English words. This means categorization
-may work better in English than in other languages. The ability to customize the
-dictionary will be added in a future release.
+`categorization_analyzer` for your job. For more information, see
+<<ml-configuring-categories>>.
+
+Additionally, a dictionary used to influence the categorization process contains
+only English words. This means categorization might work better in English than
+in other languages. The ability to customize the dictionary will be added in a
+future release.

 [float]
 === Pop-ups must be enabled in browsers
--- a/docs/en/rest-api/ml/jobresource.asciidoc
+++ b/docs/en/rest-api/ml/jobresource.asciidoc
@ -105,24 +105,23 @@ An analysis configuration object has the following properties:
  (array of strings) If `categorization_field_name` is specified,
  you can also define optional filters. This property expects an array of
  regular expressions. The expressions are used to filter out matching sequences
-  off the categorization field values. This functionality is useful to fine tune
-  categorization by excluding sequences that should not be taken into
-  consideration for defining categories. For example, you can exclude SQL
-  statements that appear in your log files. For more information, see
+  from the categorization field values. You can use this functionality to fine
+  tune the categorization by excluding sequences from consideration when
+  categories are defined. For example, you can exclude SQL statements that
+  appear in your log files. For more information, see
  {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
  This property cannot be used at the same time as `categorization_analyzer`.
-  If you only want to define simple regular expression filters to be applied
-  prior to tokenization then it is easiest to specify them using this property.
-  If you also want to customize the tokenizer or post-tokenization filtering
-  then these filters must be included in the `categorization_analyzer` as
-  `pattern_replace` `char_filter`s. The effect is exactly the same.
-//<<ml-configuring-categories>>.
+  If you only want to define simple regular expression filters that are applied
+  prior to tokenization, setting this property is the easiest method.
+  If you also want to customize the tokenizer or post-tokenization filtering,
+  use the `categorization_analyzer` property instead and include the filters as
+  `pattern_replace` character filters. The effect is exactly the same.

 `categorization_analyzer`::
-  (object or string) If `categorization_field_name` is specified,
-  you can also define the analyzer that will be used to interpret the field
-  to be categorized. See <<ml-categorizationanalyzer,categorization analyzer>>.
-//<<ml-configuring-categories>>.
+  (object or string) If `categorization_field_name` is specified, you can also
+  define the analyzer that is used to interpret the categorization field. This
+  property cannot be used at the same time as `categorization_filters`. See
+  <<ml-categorizationanalyzer,categorization analyzer>>.

 `detectors`::
  (array) An array of detector configuration objects,
@ -316,39 +315,40 @@ used to define the `analyzer` in the <<indices-analyze,Analyze endpoint>>.
 The `categorization_analyzer` field can be specified either as a string or as
 an object.

-If it is a string it must refer to a
-{ref}/analysis-analyzers.html[built-in analyzer] or one added by
-another plugin.
+If it is a string it must refer to a <<analysis-analyzers,built-in analyzer>> or
+one added by another plugin.

 If it is an object it has the following properties:

 `char_filter`::
  (array of strings or objects) One or more
-  {ref}/analysis-charfilters.html[character filters]. In addition
-  to the built-in character filters other plugins may provide more. This property
-  is optional. If not specified then there will be no character filters. If
-  you are customizing some other aspect of the analyzer and need to achieve
-  the equivalent of `categorization_filters` (which are not permitted when some
-  other aspect of the analyzer is customized), add them here as
-  {ref}/analysis-pattern-replace-charfilter.html[pattern replace character filters].
+  <<analysis-charfilters,character filters>>. In addition to the built-in
+  character filters, other plugins can provide more character filters. This
+  property is optional. If it is not specified, no character filters are applied
+  prior to categorization. If you are customizing some other aspect of the
+  analyzer and you need to achieve the equivalent of `categorization_filters`
+  (which are not permitted when some other aspect of the analyzer is customized),
+  add them here as
+  <<analysis-pattern-replace-charfilter,pattern replace character filters>>.

 `tokenizer`::
  (string or object) The name or definition of the
-  {ref}/analysis-tokenizers.html[tokenizer] to use after character
-  filters have been applied. This property is compulsory if `categorization_analyzer`
-  is specified as an object. Machine learning provides a tokenizer called `ml_classic`
-  that tokenizes in the same way as the non-customizable tokenizer in older versions of
-  the product. If you would like to stick with this but change the character or token
-  filters then specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
+  <<analysis-tokenizers,tokenizer>> to use after character filters are applied.
+  This property is compulsory if `categorization_analyzer` is specified as an
+  object. Machine learning provides a tokenizer called `ml_classic` that
+  tokenizes in the same way as the non-customizable tokenizer in older versions
+  of the product. If you want to use that tokenizer but change the character or
+  token filters, specify `"tokenizer": "ml_classic"` in your
+  `categorization_analyzer`.

 `filter`::
  (array of strings or objects) One or more
-  {ref}/analysis-tokenfilters.html[token filters]. In addition to the built-in token
-  filters other plugins may provide more. This property is optional. If not specified
-  then there will be no token filters.
+  <<analysis-tokenfilters,token filters>>. In addition to the built-in token
+  filters, other plugins can provide more token filters. This property is
+  optional. If it is not specified, no token filters are applied prior to
+  categorization.

-If you omit `categorization_analyzer` entirely then the default that will be used is
-the one from the following job:
+If you omit the `categorization_analyzer`, the following default values are used:

 [source,js]
 --------------------------------------------------
@ -379,27 +379,28 @@ POST _xpack/ml/anomaly_detectors/_validate
 --------------------------------------------------
 // CONSOLE

-However, if you specify any part of `categorization_analyzer` then any omitted
-sub-properties are _not_ defaulted.
+If you specify any part of the `categorization_analyzer`, however, any omitted
+sub-properties are _not_ set to default values.

-If you are categorizing non-English messages in a language where words are separated
-by spaces you may get better results if you change the day/month words in the stop
-token filter to those from your language. If you are categorizing messages in a language
-where words are not separated by spaces then you will need to use a different tokenizer
-as well in order to get sensible categorization results.
+If you are categorizing non-English messages in a language where words are
+separated by spaces, you might get better results if you change the day or month
+words in the stop token filter to the appropriate words in your language. If you
+are categorizing messages in a language where words are not separated by spaces,
+you must use a different tokenizer as well in order to get sensible
+categorization results.

-It is important to be aware that analyzing for categorization of machine generated
-log messages is a little different to tokenizing for search. Features that work well
-for search, such as stemming, synonym substitution and lowercasing are likely to make
-the results of categorization worse. However, in order for drilldown from machine
-learning results to work correctly, the tokens that the categorization analyzer
-produces need to be sufficiently similar to those produced by the search analyzer
-that if you search for the tokens that the categorization analyzer produces you will
-find the original document that the field to be categorized came from.
-
-For more information, see {xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].
-//<<ml-configuring-categories>>.
+It is important to be aware that analyzing for categorization of machine
+generated log messages is a little different from tokenizing for search.
+Features that work well for search, such as stemming, synonym substitution, and
+lowercasing are likely to make the results of categorization worse. However, in
+order for drill down from {ml} results to work correctly, the tokens that the
+categorization analyzer produces must be similar to those produced by the search
+analyzer. If they are sufficiently similar, when you search for the tokens that
+the categorization analyzer produces then you find the original document that
+the categorization field value came from.

+For more information, see
+{xpack-ref}/ml-configuring-categories.html[Categorizing Log Messages].

 [float]
 [[ml-apilimits]]