From 83c92cf7ebee262876fbd8e409f01420194d7c57 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Fri, 17 Jan 2020 18:52:57 +0100 Subject: [PATCH] [DOCS] Adds text about data types to the categorization docs (#51145) --- .../ml/anomaly-detection/categories.asciidoc | 51 ++++++++++++++----- 1 file changed, 37 insertions(+), 14 deletions(-) diff --git a/docs/reference/ml/anomaly-detection/categories.asciidoc b/docs/reference/ml/anomaly-detection/categories.asciidoc index 79c34950915..e159b46cc2b 100644 --- a/docs/reference/ml/anomaly-detection/categories.asciidoc +++ b/docs/reference/ml/anomaly-detection/categories.asciidoc @@ -1,6 +1,28 @@ [role="xpack"] [[ml-configuring-categories]] -=== Categorizing log messages +=== Categorizing data + +Categorization is a {ml} process that considers a tokenization of a field, +clusters similar data together, and classifies them into categories. However, +categorization doesn't work equally well on different data types. It works +best on machine-written messages and application outputs, typically on data that +consists of repeated elements, for example log messages for the purpose of +system troubleshooting. Log categorization groups unstructured log messages into +categories, then you can use {anomaly-detect} to model and identify rare or +unusual counts of log message categories. + +Categorization is tuned to work best on data like log messages by taking token +order into account, not considering synonyms, and including stop words in its +analysis. Complete sentences in human communication or literary text (for +example emails, wiki pages, prose, or other human generated content) can be +extremely diverse in structure. Since categorization is tuned for machine data +it will give poor results on such human generated data. For example, the +categorization job would create so many categories that couldn't be handled +effectively. Categorization is _not_ natural language processing (NLP). + +[float] +[[ml-categorization-log-messages]] +==== Categorizing log messages Application log events are often unstructured and contain variable data. For example: @@ -65,8 +87,8 @@ defining categories. The categorization filters are applied in the order they are listed in the job configuration, which allows you to disregard multiple sections of the categorization field value. In this example, we have decided that we do not want the detailed SQL to be considered in the message categorization. -This particular categorization filter removes the SQL statement from the categorization -algorithm. +This particular categorization filter removes the SQL statement from the +categorization algorithm. If your data is stored in {es}, you can create an advanced {anomaly-job} with these same properties: @@ -79,7 +101,7 @@ NOTE: To add the `categorization_examples_limit` property, you must use the [float] [[ml-configuring-analyzer]] -==== Customizing the categorization analyzer +===== Customizing the categorization analyzer Categorization uses English dictionary words to identify log message categories. By default, it also uses English tokenization rules. For this reason, if you use @@ -135,7 +157,8 @@ here achieves exactly the same as the `categorization_filters` in the first example. <2> The `ml_classic` tokenizer works like the non-customizable tokenization that was used for categorization in older versions of machine learning. If you -want the same categorization behavior as older versions, use this property value. +want the same categorization behavior as older versions, use this property +value. <3> By default, English day or month words are filtered from log messages before categorization. If your logs are in a different language and contain dates, you might get better results by filtering the day or month words in your @@ -178,9 +201,9 @@ POST _ml/anomaly_detectors/_validate If you specify any part of the `categorization_analyzer`, however, any omitted sub-properties are _not_ set to default values. -The `ml_classic` tokenizer and the day and month stopword filter are more or less -equivalent to the following analyzer, which is defined using only built-in {es} -{ref}/analysis-tokenizers.html[tokenizers] and +The `ml_classic` tokenizer and the day and month stopword filter are more or +less equivalent to the following analyzer, which is defined using only built-in +{es} {ref}/analysis-tokenizers.html[tokenizers] and {ref}/analysis-tokenfilters.html[token filters]: [source,console] @@ -234,11 +257,11 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3 <4> Underscores, hyphens, and dots are removed from the beginning of tokens. <5> Underscores, hyphens, and dots are also removed from the end of tokens. -The key difference between the default `categorization_analyzer` and this example -analyzer is that using the `ml_classic` tokenizer is several times faster. The -difference in behavior is that this custom analyzer does not include accented -letters in tokens whereas the `ml_classic` tokenizer does, although that could -be fixed by using more complex regular expressions. +The key difference between the default `categorization_analyzer` and this +example analyzer is that using the `ml_classic` tokenizer is several times +faster. The difference in behavior is that this custom analyzer does not include +accented letters in tokens whereas the `ml_classic` tokenizer does, although +that could be fixed by using more complex regular expressions. If you are categorizing non-English messages in a language where words are separated by spaces, you might get better results if you change the day or month @@ -263,7 +286,7 @@ API examples above. [float] [[ml-viewing-categories]] -==== Viewing categorization results +===== Viewing categorization results After you open the job and start the {dfeed} or supply data to the job, you can view the categorization results in {kib}. For example: