[DOCS] Updates categorization examples with wizard screenshots (#51133)

2020-01-22 11:26:10 -08:00 · 2020-01-22 11:26:10 -08:00 · ec47698f7c
parent 83647101ef
commit ec47698f7c
4 changed files with 85 additions and 148 deletions
--- a/docs/reference/ml/anomaly-detection/categories.asciidoc
+++ b/docs/reference/ml/anomaly-detection/categories.asciidoc
@ -1,174 +1,131 @@
 [role="xpack"]
 [[ml-configuring-categories]]
-=== Categorizing data
+=== Detecting anomalous categories of data
-Categorization is a {ml} process that considers a tokenization of a field, 
+Categorization is a {ml} process that tokenizes a text field, clusters similar
-clusters similar data together, and classifies them into categories. However, 
+data together, and classifies it into categories. It works best on
-categorization doesn't work equally well on different data types. It works 
+machine-written messages and application output that typically consist of
-best on machine-written messages and application outputs, typically on data that 
+repeated elements. For example, it works well on logs that contain a finite set
-consists of repeated elements, for example log messages for the purpose of 
+of possible messages:
 system troubleshooting. Log categorization groups unstructured log messages into 
 categories, then you can use {anomaly-detect} to model and identify rare or 
 unusual counts of log message categories.
 Categorization is tuned to work best on data like log messages by taking token
 order into account, not considering synonyms, and including stop words in its 
 analysis. Complete sentences in human communication or literary text (for 
 example emails, wiki pages, prose, or other human generated content) can be 
 extremely diverse in structure.  Since categorization is tuned for machine data 
 it will give poor results on such human generated data. For example, the 
 categorization job would create so many categories that couldn't be handled 
 effectively.  Categorization is _not_ natural language processing (NLP).
 [float]
 [[ml-categorization-log-messages]]
 ==== Categorizing log messages
 Application log events are often unstructured and contain variable data. For
 example:
 //Obtained from it_ops_new_app_logs.json
 [source,js]
 ----------------------------------
-{"time":1454516381000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}
+{"@timestamp":1549596476000,
 "message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
 "type":"logs"}
 ----------------------------------
 //NOTCONSOLE
-You can use {ml} to observe the static parts of the message, cluster similar
+Categorization is tuned to work best on data like log messages by taking token
-messages together, and classify them into message categories.
+order into account, including stop words, and not considering synonyms in its
 analysis. Complete sentences in human communication or literary text (for
 example email, wiki pages, prose, or other human-generated content) can be 
 extremely diverse in structure. Since categorization is tuned for machine data, 
 it gives poor results for human-generated data. It would create so many
 categories that they couldn't be handled effectively. Categorization is _not_
 natural language processing (NLP).
-The {ml} model learns what volume and pattern is normal for each category over
+When you create a categorization {anomaly-job}, the {ml} model learns what
-time. You can then detect anomalies and surface rare events or unusual types of
+volume and pattern is normal for each category over time. You can then detect
-messages by using count or rare functions. For example:
+anomalies and surface rare events or unusual types of messages by using
 <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
-//Obtained from it_ops_new_app_logs.sh
+In {kib}, there is a categorization wizard to help you create this type of 
 {anomaly-job}. For example, the following job generates categories from the
 contents of the `message` field and uses the count function to determine when
 certain categories are occurring at anomalous rates:
 [role="screenshot"]
 image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"]
 [%collapsible]
 .API example
 ====
 [source,console]
 ----------------------------------
-PUT _ml/anomaly_detectors/it_ops_new_logs
+PUT _ml/anomaly_detectors/it_ops_app_logs
 {
-  "description" : "IT Ops Application Logs",
+  "description" : "IT ops application logs",
  "analysis_config" : {
-    "categorization_field_name": "message", <1>
+    "categorization_field_name": "message",<1>
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
-      "by_field_name": "mlcategory", <2>
+      "by_field_name": "mlcategory"<2>
-      "detector_description": "Unusual message counts"
+    }]
    }],
    "categorization_filters":[ "\\[statement:.*\\]"]
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
-    "time_field":"time",
+    "time_field":"@timestamp"
    "time_format": "epoch_ms"
  }
 }
 ----------------------------------
 // TEST[skip:needs-licence]
-
+<1> This field is used to derive categories.
-<1> The `categorization_field_name` property indicates which field will be
+<2> The categories are used in a detector by setting `by_field_name`,
 categorized.
 <2> The resulting categories are used in a detector by setting `by_field_name`,
 `over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
 do not specify this keyword in one of those properties, the API request fails.
 ====
 The optional `categorization_examples_limit` property specifies the
 maximum number of examples that are stored in memory and in the results data
 store for each category. The default value is `4`. Note that this setting does
 not affect the categorization; it just affects the list of visible examples. If
 you increase this value, more examples are available, but you must have more
 storage available. If you set this value to `0`, no examples are stored.
-The optional `categorization_filters` property can contain an array of regular
+You can use the **Anomaly Explorer** in {kib} to view the analysis results: 
 expressions. If a categorization field value matches the regular expression, the
 portion of the field that is matched is not taken into consideration when
 defining categories. The categorization filters are applied in the order they
 are listed in the job configuration, which allows you to disregard multiple
 sections of the categorization field value. In this example, we have decided that
 we do not want the detailed SQL to be considered in the message categorization.
 This particular categorization filter removes the SQL statement from the 
 categorization algorithm.
 If your data is stored in {es}, you can create an advanced {anomaly-job} with
 these same properties:
 [role="screenshot"]
-image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]
+image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
-NOTE: To add the `categorization_examples_limit` property, you must use the
+For this type of job, the results contain extra information for each anomaly:
-**Edit JSON** tab and copy the `analysis_limits` object from the API example.
+the name of the category (for example, `mlcategory 2`) and examples of the
 messages in that category. You can use these details to investigate occurrences
 of unusually high message counts.
-[float]
+If you use the advanced {anomaly-job} wizard in {kib} or the
 {ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional
 configuration options. For example, the optional `categorization_examples_limit`
 property specifies the maximum number of examples that are stored in memory and
 in the results data store for each category. The default value is `4`. Note that
 this setting does not affect the categorization; it just affects the list of
 visible examples. If you increase this value, more examples are available, but
 you must have more storage available. If you set this value to `0`, no examples
 are stored.
 Another advanced option is the `categorization_filters` property, which can
 contain an array of regular expressions. If a categorization field value matches
 the regular expression, the portion of the field that is matched is not taken
 into consideration when defining categories. The categorization filters are
 applied in the order they are listed in the job configuration, which enables you
 to disregard multiple sections of the categorization field value. In this
 example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the
 SQL statement from the categorization algorithm.
 [discrete]
 [[ml-configuring-analyzer]]
-===== Customizing the categorization analyzer
+==== Customizing the categorization analyzer
 Categorization uses English dictionary words to identify log message categories.
 By default, it also uses English tokenization rules. For this reason, if you use
 the default categorization analyzer, only English language log messages are
-supported, as described in the <<ml-limitations>>.
+supported, as described in the <<ml-limitations>>. 
-You can, however, change the tokenization rules by customizing the way the
+If you use the categorization wizard in {kib}, you can see which categorization
-categorization field values are interpreted. For example:
+analyzer it uses and highlighted examples of the tokens that it identifies. You
 can also change the tokenization rules by customizing the way the categorization
 field values are interpreted:
-[source,console]
+[role="screenshot"]
----------------------------------
+image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]
 PUT _ml/anomaly_detectors/it_ops_new_logs2
 {
  "description" : "IT Ops Application Logs",
  "analysis_config" : {
    "categorization_field_name": "message",
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory",
      "detector_description": "Unusual message counts"
    }],
    "categorization_analyzer":{
      "char_filter": [
        { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
      ],
      "tokenizer": "ml_classic", <2>
      "filter": [
        { "type" : "stop", "stopwords": [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] } <3>
      ]
    }
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
    "time_field":"time",
    "time_format": "epoch_ms"
  }
 }
 ----------------------------------
 // TEST[skip:needs-licence]
-<1> The
+The categorization analyzer can refer to a built-in {es} analyzer or a
 combination of zero or more character filters, a tokenizer, and zero or more
 token filters. In this example, adding a 
 {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
-here achieves exactly the same as the `categorization_filters` in the first
+achieves exactly the same behavior as the `categorization_filters` job
-example.
+configuration option described earlier. For more details about these properties,
-<2> The `ml_classic` tokenizer works like the non-customizable tokenization
+see the
-that was used for categorization in older versions of machine learning. If you
+{ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].
 want the same categorization behavior as older versions, use this property 
 value.
 <3> By default, English day or month words are filtered from log messages before
 categorization. If your logs are in a different language and contain
 dates, you might get better results by filtering the day or month words in your
 language.
-The optional `categorization_analyzer` property allows even greater customization
+If you use the default categorization analyzer in {kib} or omit the
-of how categorization interprets the categorization field value. It can refer to
+`categorization_analyzer` property from the API, the following default values
-a built-in {es} analyzer or a combination of zero or more character filters,
+are used:
 a tokenizer, and zero or more token filters. If you omit the
 `categorization_analyzer`, the following default values are used:
 [source,console]
 --------------------------------------------------
@ -279,23 +236,3 @@ categorization analyzer produces must be similar to those produced by the search
 analyzer. If they are sufficiently similar, when you search for the tokens that
 the categorization analyzer produces then you find the original document that
 the categorization field value came from.
 NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
 **Edit JSON** tab and copy the `categorization_analyzer` object from one of the
 API examples above.
 [float]
 [[ml-viewing-categories]]
 ===== Viewing categorization results
 After you open the job and start the {dfeed} or supply data to the job, you can
 view the categorization results in {kib}. For example:
 [role="screenshot"]
 image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]
 For this type of job, the **Anomaly Explorer** contains extra information for
 each anomaly: the name of the category (for example, `mlcategory 11`) and
 examples of the messages in that category. In this case, you can use these
 details to investigate occurrences of unusually high message counts for specific
 message categories.
--- a/docs/reference/ml/images/ml-category-analyzer.jpg
+++ b/docs/reference/ml/images/ml-category-analyzer.jpg
--- a/docs/reference/ml/images/ml-category-anomalies.jpg
+++ b/docs/reference/ml/images/ml-category-anomalies.jpg
--- a/docs/reference/ml/images/ml-category-wizard.jpg
+++ b/docs/reference/ml/images/ml-category-wizard.jpg