OpenSearch/docs/en/ml/categories.asciidoc

[[ml-configuring-categories]]
=== Categorizing log messages

Application log events are often unstructured and contain variable data. For
example:
//Obtained from it_ops_new_app_logs.json
[source,js]
----------------------------------
{"time":1454516381000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}
----------------------------------
//NOTCONSOLE

You can use {ml} to observe the static parts of the message, cluster similar
messages together, and classify them into message categories.

NOTE: Categorization uses English tokenization rules and dictionary words in
order to identify log message categories. As such, only English language log
messages are supported.

The {ml} model learns what volume and pattern is normal for each category over
time. You can then detect anomalies and surface rare events or unusual types of
messages by using count or rare functions. For example:

//Obtained from it_ops_new_app_logs.sh
[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs
{
  "description" : "IT Ops Application Logs",
  "analysis_config" : {
    "categorization_field_name": "message", <1>
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory", <2>
      "detector_description": "Unusual message counts"
    }],
    "categorization_filters":[ "\\[statement:.*\\]"]
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
    "time_field":"time",
    "time_format": "epoch_ms"
  }
}
----------------------------------
//CONSOLE
<1> The `categorization_field_name` property indicates which field will be
categorized.
<2> The resulting categories are used in a detector by setting `by_field_name`,
`over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
do not specify this keyword in one of those properties, the API request fails.

The optional `categorization_examples_limit` property specifies the
maximum number of examples that are stored in memory and in the results data
store for each category. The default value is `4`. Note that this setting does
not affect the categorization; it just affects the list of visible examples. If
you increase this value, more examples are available, but you must have more
storage available. If you set this value to `0`, no examples are stored.

The optional `categorization_filters` property can contain an array of regular
expressions. If a categorization field value matches the regular expression, the
portion of the field that is matched is not taken into consideration when
defining categories. The categorization filters are applied in the order they
are listed in the job configuration, which allows you to disregard multiple
sections of the categorization field value. In this example, we have decided that
we do not want the detailed SQL to be considered in the message categorization.
This particular categorization filter removes the SQL statement from the categorization
algorithm.

If your data is stored in {es}, you can create an advanced job with these same
properties:

[role="screenshot"]
image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]

NOTE: To add the `categorization_examples_limit` property, you must use the
**Edit JSON** tab and copy the `analysis_limits` object from the API example.

It is possible to customize the way the categorization field values are interpreted
to an even greater extent:

[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2
{
  "description" : "IT Ops Application Logs",
  "analysis_config" : {
    "categorization_field_name": "message",
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory",
      "detector_description": "Unusual message counts"
    }],
    "categorization_analyzer":{
      "char_filter": [
        { "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
      ],
      "tokenizer": "ml_classic", <2>
      "filter": [
        { "type" : "stop", "stopwords": [
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] } <3>
      ]
    }
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
    "time_field":"time",
    "time_format": "epoch_ms"
  }
}
----------------------------------
//CONSOLE
<1> The
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
here achieves exactly the same as the `categorization_filters` in the first
example.
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
that was used for categorization in older versions of machine learning. Use
it if you want the same categorization behavior as older versions.
<3> English day/month words are filtered by default from log messages
before categorization. If your logs are in a different language and contain
dates then you may get better results by filtering day/month words in your
language.

The optional `categorization_analyzer` property allows even greater customization
of how categorization interprets the categorization field value. It can refer to
a built-in Elasticsearch analyzer, or a combination of zero or more character
filters, a tokenizer, and zero or more token filters.

The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
equivalent to the following analyzer defined using only built-in Elasticsearch
{ref}/analysis-tokenizers.html[tokenizers] and
{ref}/analysis-tokenfilters.html[token filters]:

[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3
{
  "description" : "IT Ops Application Logs",
  "analysis_config" : {
    "categorization_field_name": "message",
    "bucket_span":"30m",
    "detectors" :[{
      "function":"count",
      "by_field_name": "mlcategory",
      "detector_description": "Unusual message counts"
    }],
    "categorization_analyzer":{
      "tokenizer": {
        "type" : "simple_pattern_split",
        "pattern" : "[^-0-9A-Za-z_.]+" <1>
      },
      "filter": [
        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
        { "type" : "stop", "stopwords": [
          "",
          "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
          "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
          "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
          "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
          "GMT", "UTC"
        ] }
      ]
    }
  },
  "analysis_limits":{
    "categorization_examples_limit": 5
  },
  "data_description" : {
    "time_field":"time",
    "time_format": "epoch_ms"
  }
}
----------------------------------
//CONSOLE
<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
<2> By default categorization ignores tokens that begin with a digit.
<3> By default categorization also ignores tokens that are hexadecimal numbers.
<4> Underscores, hypens and dots are removed from the beginning of tokens.
<5> Also at the end of tokens.

The key difference between the default `categorization_analyzer` and this example
analyzer is that using the `ml_classic` tokenizer is several times faster. (The
difference in behavior is that this custom analyzer does not include accented
letters in tokens whereas the `ml_classic` tokenizer will, although that could be
fixed by using more complex regular expressions.)

NOTE: To add the `categorization_analyzer` property, you must use the **Edit JSON**
tab and copy the `categorization_analyzer` object from one of the API examples above.


After you open the job and start the {dfeed} or supply data to the job, you can
view the results in {kib}. For example:

[role="screenshot"]
image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]

For this type of job, the **Anomaly Explorer** contains extra information for
each anomaly: the name of the category (for example, `mlcategory 11`) and
examples of the messages in that category. In this case, you can use these
details to investigate occurrences of unusually high message counts for specific
message categories.
[DOCS] Add ML categorization of messages (elastic/x-pack-elasticsearch#1666) * [DOCS] Add ML categorization of messages * [DOCS] Describe ML categorization_examples_limit property * [DOCS] Updated ML categorization of messages * [DOCS] Add links to ML categorization Original commit: elastic/x-pack-elasticsearch@6403f6ce847c95199a9bcbfecd29c869621d63bd 2017-06-12 10:41:14 -07:00			`[[ml-configuring-categories]]`
			`=== Categorizing log messages`

			`Application log events are often unstructured and contain variable data. For`
			`example:`
			`//Obtained from it_ops_new_app_logs.json`
			`[source,js]`
			`----------------------------------`
			`{"time":1454516381000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}`
			`----------------------------------`
			`//NOTCONSOLE`

			`You can use {ml} to observe the static parts of the message, cluster similar`
[DOCS] Add categorization limitation (elastic/x-pack-elasticsearch#3022) Original commit: elastic/x-pack-elasticsearch@387d7cf939799f924348b209cb0cdcdcece8db43 2017-11-16 10:00:40 -08:00			`messages together, and classify them into message categories.`

			`NOTE: Categorization uses English tokenization rules and dictionary words in`
			`order to identify log message categories. As such, only English language log`
			`messages are supported.`

			`The {ml} model learns what volume and pattern is normal for each category over`
			`time. You can then detect anomalies and surface rare events or unusual types of`
			`messages by using count or rare functions. For example:`
[DOCS] Add ML categorization of messages (elastic/x-pack-elasticsearch#1666) * [DOCS] Add ML categorization of messages * [DOCS] Describe ML categorization_examples_limit property * [DOCS] Updated ML categorization of messages * [DOCS] Add links to ML categorization Original commit: elastic/x-pack-elasticsearch@6403f6ce847c95199a9bcbfecd29c869621d63bd 2017-06-12 10:41:14 -07:00
			`//Obtained from it_ops_new_app_logs.sh`
			`[source,js]`
			`----------------------------------`
			`PUT _xpack/ml/anomaly_detectors/it_ops_new_logs`
			`{`
			`"description" : "IT Ops Application Logs",`
			`"analysis_config" : {`
			`"categorization_field_name": "message", <1>`
			`"bucket_span":"30m",`
			`"detectors" :[{`
			`"function":"count",`
			`"by_field_name": "mlcategory", <2>`
			`"detector_description": "Unusual message counts"`
			`}],`
			`"categorization_filters":[ "\\[statement:.*\\]"]`
			`},`
			`"analysis_limits":{`
			`"categorization_examples_limit": 5`
			`},`
			`"data_description" : {`
			`"time_field":"time",`
			`"time_format": "epoch_ms"`
			`}`
			`}`
			`----------------------------------`
			`//CONSOLE`
			<1> The `categorization_field_name` property indicates which field will be
			`categorized.`
[DOCS] Clarify null behaviour in ML resources (elastic/x-pack-elasticsearch#2090) * [DOCS] Clarify null behaviour in ML resources * [DOCS] Add category error information Original commit: elastic/x-pack-elasticsearch@21cdb4fb36ac64511988a47e0bf1315d0ed54546 2017-07-27 08:02:46 -07:00			<2> The resulting categories are used in a detector by setting `by_field_name`,
			`over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
			`do not specify this keyword in one of those properties, the API request fails.`
[DOCS] Add ML categorization of messages (elastic/x-pack-elasticsearch#1666) * [DOCS] Add ML categorization of messages * [DOCS] Describe ML categorization_examples_limit property * [DOCS] Updated ML categorization of messages * [DOCS] Add links to ML categorization Original commit: elastic/x-pack-elasticsearch@6403f6ce847c95199a9bcbfecd29c869621d63bd 2017-06-12 10:41:14 -07:00
			The optional `categorization_examples_limit` property specifies the
			`maximum number of examples that are stored in memory and in the results data`
			store for each category. The default value is `4`. Note that this setting does
			`not affect the categorization; it just affects the list of visible examples. If`
			`you increase this value, more examples are available, but you must have more`
			storage available. If you set this value to `0`, no examples are stored.

			The optional `categorization_filters` property can contain an array of regular
			`expressions. If a categorization field value matches the regular expression, the`
			`portion of the field that is matched is not taken into consideration when`
			`defining categories. The categorization filters are applied in the order they`
			`are listed in the job configuration, which allows you to disregard multiple`
			`sections of the categorization field value. In this example, we have decided that`
			`we do not want the detailed SQL to be considered in the message categorization.`
			`This particular categorization filter removes the SQL statement from the categorization`
			`algorithm.`

			`If your data is stored in {es}, you can create an advanced job with these same`
			`properties:`

			`[role="screenshot"]`
			`image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]`

			NOTE: To add the `categorization_examples_limit` property, you must use the
			Edit JSON tab and copy the `analysis_limits` object from the API example.

[DOCS] Add documentation for ML categorization_analyzer (elastic/x-pack-elasticsearch#3554) This is the documentation for the changes made in elastic/x-pack-elasticsearch#3372. Relates elastic/machine-learning-cpp#491 Original commit: elastic/x-pack-elasticsearch@7d67e9d89446bedab878dc8306f746d2c1b9929a 2018-01-15 15:47:19 +00:00			`It is possible to customize the way the categorization field values are interpreted`
			`to an even greater extent:`

			`[source,js]`
			`----------------------------------`
			`PUT _xpack/ml/anomaly_detectors/it_ops_new_logs2`
			`{`
			`"description" : "IT Ops Application Logs",`
			`"analysis_config" : {`
			`"categorization_field_name": "message",`
			`"bucket_span":"30m",`
			`"detectors" :[{`
			`"function":"count",`
			`"by_field_name": "mlcategory",`
			`"detector_description": "Unusual message counts"`
			`}],`
			`"categorization_analyzer":{`
			`"char_filter": [`
			`{ "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>`
			`],`
			`"tokenizer": "ml_classic", <2>`
			`"filter": [`
			`{ "type" : "stop", "stopwords": [`
			`"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",`
			`"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",`
			`"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",`
			`"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",`
			`"GMT", "UTC"`
			`] } <3>`
			`]`
			`}`
			`},`
			`"analysis_limits":{`
			`"categorization_examples_limit": 5`
			`},`
			`"data_description" : {`
			`"time_field":"time",`
			`"time_format": "epoch_ms"`
			`}`
			`}`
			`----------------------------------`
			`//CONSOLE`
			`<1> The`
			{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
			here achieves exactly the same as the `categorization_filters` in the first
			`example.`
			<2> The `ml_classic` tokenizer works like the non-customizable tokenization
			`that was used for categorization in older versions of machine learning. Use`
			`it if you want the same categorization behavior as older versions.`
			`<3> English day/month words are filtered by default from log messages`
			`before categorization. If your logs are in a different language and contain`
			`dates then you may get better results by filtering day/month words in your`
			`language.`

			The optional `categorization_analyzer` property allows even greater customization
			`of how categorization interprets the categorization field value. It can refer to`
			`a built-in Elasticsearch analyzer, or a combination of zero or more character`
			`filters, a tokenizer, and zero or more token filters.`

			The `ml_classic` tokenizer and the day/month stopword filter are more-or-less
			`equivalent to the following analyzer defined using only built-in Elasticsearch`
			`{ref}/analysis-tokenizers.html[tokenizers] and`
			`{ref}/analysis-tokenfilters.html[token filters]:`

			`[source,js]`
			`----------------------------------`
			`PUT _xpack/ml/anomaly_detectors/it_ops_new_logs3`
			`{`
			`"description" : "IT Ops Application Logs",`
			`"analysis_config" : {`
			`"categorization_field_name": "message",`
			`"bucket_span":"30m",`
			`"detectors" :[{`
			`"function":"count",`
			`"by_field_name": "mlcategory",`
			`"detector_description": "Unusual message counts"`
			`}],`
			`"categorization_analyzer":{`
			`"tokenizer": {`
			`"type" : "simple_pattern_split",`
			`"pattern" : "[^-0-9A-Za-z_.]+" <1>`
			`},`
			`"filter": [`
			`{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>`
			`{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>`
			`{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>`
			`{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>`
			`{ "type" : "stop", "stopwords": [`
			`"",`
			`"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",`
			`"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",`
			`"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",`
			`"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",`
			`"GMT", "UTC"`
			`] }`
			`]`
			`}`
			`},`
			`"analysis_limits":{`
			`"categorization_examples_limit": 5`
			`},`
			`"data_description" : {`
			`"time_field":"time",`
			`"time_format": "epoch_ms"`
			`}`
			`}`
			`----------------------------------`
			`//CONSOLE`
			`<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.`
			`<2> By default categorization ignores tokens that begin with a digit.`
			`<3> By default categorization also ignores tokens that are hexadecimal numbers.`
			`<4> Underscores, hypens and dots are removed from the beginning of tokens.`
			`<5> Also at the end of tokens.`

			The key difference between the default `categorization_analyzer` and this example
			analyzer is that using the `ml_classic` tokenizer is several times faster. (The
			`difference in behavior is that this custom analyzer does not include accented`
			letters in tokens whereas the `ml_classic` tokenizer will, although that could be
			`fixed by using more complex regular expressions.)`

			NOTE: To add the `categorization_analyzer` property, you must use the Edit JSON
			tab and copy the `categorization_analyzer` object from one of the API examples above.

[DOCS] Add ML categorization of messages (elastic/x-pack-elasticsearch#1666) * [DOCS] Add ML categorization of messages * [DOCS] Describe ML categorization_examples_limit property * [DOCS] Updated ML categorization of messages * [DOCS] Add links to ML categorization Original commit: elastic/x-pack-elasticsearch@6403f6ce847c95199a9bcbfecd29c869621d63bd 2017-06-12 10:41:14 -07:00
			`After you open the job and start the {dfeed} or supply data to the job, you can`
			`view the results in {kib}. For example:`

			`[role="screenshot"]`
			`image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]`

			`For this type of job, the Anomaly Explorer contains extra information for`
			each anomaly: the name of the category (for example, `mlcategory 11`) and
			`examples of the messages in that category. In this case, you can use these`
			`details to investigate occurrences of unusually high message counts for specific`
			`message categories.`