[DOCS] Updates categorization examples with wizard screenshots (#51133)
This commit is contained in:
parent
83647101ef
commit
ec47698f7c
|
@ -1,174 +1,131 @@
|
|||
[role="xpack"]
|
||||
[[ml-configuring-categories]]
|
||||
=== Categorizing data
|
||||
=== Detecting anomalous categories of data
|
||||
|
||||
Categorization is a {ml} process that considers a tokenization of a field,
|
||||
clusters similar data together, and classifies them into categories. However,
|
||||
categorization doesn't work equally well on different data types. It works
|
||||
best on machine-written messages and application outputs, typically on data that
|
||||
consists of repeated elements, for example log messages for the purpose of
|
||||
system troubleshooting. Log categorization groups unstructured log messages into
|
||||
categories, then you can use {anomaly-detect} to model and identify rare or
|
||||
unusual counts of log message categories.
|
||||
Categorization is a {ml} process that tokenizes a text field, clusters similar
|
||||
data together, and classifies it into categories. It works best on
|
||||
machine-written messages and application output that typically consist of
|
||||
repeated elements. For example, it works well on logs that contain a finite set
|
||||
of possible messages:
|
||||
|
||||
Categorization is tuned to work best on data like log messages by taking token
|
||||
order into account, not considering synonyms, and including stop words in its
|
||||
analysis. Complete sentences in human communication or literary text (for
|
||||
example emails, wiki pages, prose, or other human generated content) can be
|
||||
extremely diverse in structure. Since categorization is tuned for machine data
|
||||
it will give poor results on such human generated data. For example, the
|
||||
categorization job would create so many categories that couldn't be handled
|
||||
effectively. Categorization is _not_ natural language processing (NLP).
|
||||
|
||||
[float]
|
||||
[[ml-categorization-log-messages]]
|
||||
==== Categorizing log messages
|
||||
|
||||
Application log events are often unstructured and contain variable data. For
|
||||
example:
|
||||
//Obtained from it_ops_new_app_logs.json
|
||||
[source,js]
|
||||
----------------------------------
|
||||
{"time":1454516381000,"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]","type":"logs"}
|
||||
{"@timestamp":1549596476000,
|
||||
"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
|
||||
"type":"logs"}
|
||||
----------------------------------
|
||||
//NOTCONSOLE
|
||||
|
||||
You can use {ml} to observe the static parts of the message, cluster similar
|
||||
messages together, and classify them into message categories.
|
||||
Categorization is tuned to work best on data like log messages by taking token
|
||||
order into account, including stop words, and not considering synonyms in its
|
||||
analysis. Complete sentences in human communication or literary text (for
|
||||
example email, wiki pages, prose, or other human-generated content) can be
|
||||
extremely diverse in structure. Since categorization is tuned for machine data,
|
||||
it gives poor results for human-generated data. It would create so many
|
||||
categories that they couldn't be handled effectively. Categorization is _not_
|
||||
natural language processing (NLP).
|
||||
|
||||
The {ml} model learns what volume and pattern is normal for each category over
|
||||
time. You can then detect anomalies and surface rare events or unusual types of
|
||||
messages by using count or rare functions. For example:
|
||||
When you create a categorization {anomaly-job}, the {ml} model learns what
|
||||
volume and pattern is normal for each category over time. You can then detect
|
||||
anomalies and surface rare events or unusual types of messages by using
|
||||
<<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
|
||||
|
||||
//Obtained from it_ops_new_app_logs.sh
|
||||
In {kib}, there is a categorization wizard to help you create this type of
|
||||
{anomaly-job}. For example, the following job generates categories from the
|
||||
contents of the `message` field and uses the count function to determine when
|
||||
certain categories are occurring at anomalous rates:
|
||||
|
||||
[role="screenshot"]
|
||||
image::images/ml-category-wizard.jpg["Creating a categorization job in Kibana"]
|
||||
|
||||
[%collapsible]
|
||||
.API example
|
||||
====
|
||||
[source,console]
|
||||
----------------------------------
|
||||
PUT _ml/anomaly_detectors/it_ops_new_logs
|
||||
PUT _ml/anomaly_detectors/it_ops_app_logs
|
||||
{
|
||||
"description" : "IT Ops Application Logs",
|
||||
"description" : "IT ops application logs",
|
||||
"analysis_config" : {
|
||||
"categorization_field_name": "message", <1>
|
||||
"categorization_field_name": "message",<1>
|
||||
"bucket_span":"30m",
|
||||
"detectors" :[{
|
||||
"function":"count",
|
||||
"by_field_name": "mlcategory", <2>
|
||||
"detector_description": "Unusual message counts"
|
||||
}],
|
||||
"categorization_filters":[ "\\[statement:.*\\]"]
|
||||
},
|
||||
"analysis_limits":{
|
||||
"categorization_examples_limit": 5
|
||||
"by_field_name": "mlcategory"<2>
|
||||
}]
|
||||
},
|
||||
"data_description" : {
|
||||
"time_field":"time",
|
||||
"time_format": "epoch_ms"
|
||||
"time_field":"@timestamp"
|
||||
}
|
||||
}
|
||||
----------------------------------
|
||||
// TEST[skip:needs-licence]
|
||||
|
||||
<1> The `categorization_field_name` property indicates which field will be
|
||||
categorized.
|
||||
<2> The resulting categories are used in a detector by setting `by_field_name`,
|
||||
<1> This field is used to derive categories.
|
||||
<2> The categories are used in a detector by setting `by_field_name`,
|
||||
`over_field_name`, or `partition_field_name` to the keyword `mlcategory`. If you
|
||||
do not specify this keyword in one of those properties, the API request fails.
|
||||
====
|
||||
|
||||
The optional `categorization_examples_limit` property specifies the
|
||||
maximum number of examples that are stored in memory and in the results data
|
||||
store for each category. The default value is `4`. Note that this setting does
|
||||
not affect the categorization; it just affects the list of visible examples. If
|
||||
you increase this value, more examples are available, but you must have more
|
||||
storage available. If you set this value to `0`, no examples are stored.
|
||||
|
||||
The optional `categorization_filters` property can contain an array of regular
|
||||
expressions. If a categorization field value matches the regular expression, the
|
||||
portion of the field that is matched is not taken into consideration when
|
||||
defining categories. The categorization filters are applied in the order they
|
||||
are listed in the job configuration, which allows you to disregard multiple
|
||||
sections of the categorization field value. In this example, we have decided that
|
||||
we do not want the detailed SQL to be considered in the message categorization.
|
||||
This particular categorization filter removes the SQL statement from the
|
||||
categorization algorithm.
|
||||
|
||||
If your data is stored in {es}, you can create an advanced {anomaly-job} with
|
||||
these same properties:
|
||||
You can use the **Anomaly Explorer** in {kib} to view the analysis results:
|
||||
|
||||
[role="screenshot"]
|
||||
image::images/ml-category-advanced.jpg["Advanced job configuration options related to categorization"]
|
||||
image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
|
||||
|
||||
NOTE: To add the `categorization_examples_limit` property, you must use the
|
||||
**Edit JSON** tab and copy the `analysis_limits` object from the API example.
|
||||
For this type of job, the results contain extra information for each anomaly:
|
||||
the name of the category (for example, `mlcategory 2`) and examples of the
|
||||
messages in that category. You can use these details to investigate occurrences
|
||||
of unusually high message counts.
|
||||
|
||||
[float]
|
||||
If you use the advanced {anomaly-job} wizard in {kib} or the
|
||||
{ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional
|
||||
configuration options. For example, the optional `categorization_examples_limit`
|
||||
property specifies the maximum number of examples that are stored in memory and
|
||||
in the results data store for each category. The default value is `4`. Note that
|
||||
this setting does not affect the categorization; it just affects the list of
|
||||
visible examples. If you increase this value, more examples are available, but
|
||||
you must have more storage available. If you set this value to `0`, no examples
|
||||
are stored.
|
||||
|
||||
Another advanced option is the `categorization_filters` property, which can
|
||||
contain an array of regular expressions. If a categorization field value matches
|
||||
the regular expression, the portion of the field that is matched is not taken
|
||||
into consideration when defining categories. The categorization filters are
|
||||
applied in the order they are listed in the job configuration, which enables you
|
||||
to disregard multiple sections of the categorization field value. In this
|
||||
example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the
|
||||
SQL statement from the categorization algorithm.
|
||||
|
||||
[discrete]
|
||||
[[ml-configuring-analyzer]]
|
||||
===== Customizing the categorization analyzer
|
||||
==== Customizing the categorization analyzer
|
||||
|
||||
Categorization uses English dictionary words to identify log message categories.
|
||||
By default, it also uses English tokenization rules. For this reason, if you use
|
||||
the default categorization analyzer, only English language log messages are
|
||||
supported, as described in the <<ml-limitations>>.
|
||||
supported, as described in the <<ml-limitations>>.
|
||||
|
||||
You can, however, change the tokenization rules by customizing the way the
|
||||
categorization field values are interpreted. For example:
|
||||
If you use the categorization wizard in {kib}, you can see which categorization
|
||||
analyzer it uses and highlighted examples of the tokens that it identifies. You
|
||||
can also change the tokenization rules by customizing the way the categorization
|
||||
field values are interpreted:
|
||||
|
||||
[source,console]
|
||||
----------------------------------
|
||||
PUT _ml/anomaly_detectors/it_ops_new_logs2
|
||||
{
|
||||
"description" : "IT Ops Application Logs",
|
||||
"analysis_config" : {
|
||||
"categorization_field_name": "message",
|
||||
"bucket_span":"30m",
|
||||
"detectors" :[{
|
||||
"function":"count",
|
||||
"by_field_name": "mlcategory",
|
||||
"detector_description": "Unusual message counts"
|
||||
}],
|
||||
"categorization_analyzer":{
|
||||
"char_filter": [
|
||||
{ "type": "pattern_replace", "pattern": "\\[statement:.*\\]" } <1>
|
||||
],
|
||||
"tokenizer": "ml_classic", <2>
|
||||
"filter": [
|
||||
{ "type" : "stop", "stopwords": [
|
||||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
|
||||
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
|
||||
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
|
||||
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
||||
"GMT", "UTC"
|
||||
] } <3>
|
||||
]
|
||||
}
|
||||
},
|
||||
"analysis_limits":{
|
||||
"categorization_examples_limit": 5
|
||||
},
|
||||
"data_description" : {
|
||||
"time_field":"time",
|
||||
"time_format": "epoch_ms"
|
||||
}
|
||||
}
|
||||
----------------------------------
|
||||
// TEST[skip:needs-licence]
|
||||
[role="screenshot"]
|
||||
image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in Kibana"]
|
||||
|
||||
<1> The
|
||||
The categorization analyzer can refer to a built-in {es} analyzer or a
|
||||
combination of zero or more character filters, a tokenizer, and zero or more
|
||||
token filters. In this example, adding a
|
||||
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
|
||||
here achieves exactly the same as the `categorization_filters` in the first
|
||||
example.
|
||||
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
|
||||
that was used for categorization in older versions of machine learning. If you
|
||||
want the same categorization behavior as older versions, use this property
|
||||
value.
|
||||
<3> By default, English day or month words are filtered from log messages before
|
||||
categorization. If your logs are in a different language and contain
|
||||
dates, you might get better results by filtering the day or month words in your
|
||||
language.
|
||||
achieves exactly the same behavior as the `categorization_filters` job
|
||||
configuration option described earlier. For more details about these properties,
|
||||
see the
|
||||
{ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].
|
||||
|
||||
The optional `categorization_analyzer` property allows even greater customization
|
||||
of how categorization interprets the categorization field value. It can refer to
|
||||
a built-in {es} analyzer or a combination of zero or more character filters,
|
||||
a tokenizer, and zero or more token filters. If you omit the
|
||||
`categorization_analyzer`, the following default values are used:
|
||||
If you use the default categorization analyzer in {kib} or omit the
|
||||
`categorization_analyzer` property from the API, the following default values
|
||||
are used:
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
|
@ -279,23 +236,3 @@ categorization analyzer produces must be similar to those produced by the search
|
|||
analyzer. If they are sufficiently similar, when you search for the tokens that
|
||||
the categorization analyzer produces then you find the original document that
|
||||
the categorization field value came from.
|
||||
|
||||
NOTE: To add the `categorization_analyzer` property in {kib}, you must use the
|
||||
**Edit JSON** tab and copy the `categorization_analyzer` object from one of the
|
||||
API examples above.
|
||||
|
||||
[float]
|
||||
[[ml-viewing-categories]]
|
||||
===== Viewing categorization results
|
||||
|
||||
After you open the job and start the {dfeed} or supply data to the job, you can
|
||||
view the categorization results in {kib}. For example:
|
||||
|
||||
[role="screenshot"]
|
||||
image::images/ml-category-anomalies.jpg["Categorization example in the Anomaly Explorer"]
|
||||
|
||||
For this type of job, the **Anomaly Explorer** contains extra information for
|
||||
each anomaly: the name of the category (for example, `mlcategory 11`) and
|
||||
examples of the messages in that category. In this case, you can use these
|
||||
details to investigate occurrences of unusually high message counts for specific
|
||||
message categories.
|
||||
|
|
Binary file not shown.
After Width: | Height: | Size: 760 KiB |
Binary file not shown.
Before Width: | Height: | Size: 347 KiB After Width: | Height: | Size: 370 KiB |
Binary file not shown.
After Width: | Height: | Size: 440 KiB |
Loading…
Reference in New Issue