[DOCS] Adds text about data types to the categorization docs (#51145)

2020-01-17 18:52:57 +01:00 · 2020-01-17 18:52:57 +01:00 · 83c92cf7eb
parent ccf3e443b5
commit 83c92cf7eb
1 changed files with 37 additions and 14 deletions
--- a/docs/reference/ml/anomaly-detection/categories.asciidoc
+++ b/docs/reference/ml/anomaly-detection/categories.asciidoc
@ -1,6 +1,28 @@
 [role="xpack"]
 [[ml-configuring-categories]]
-=== Categorizing log messages
+=== Categorizing data
 Categorization is a {ml} process that considers a tokenization of a field, 
 clusters similar data together, and classifies them into categories. However, 
 categorization doesn't work equally well on different data types. It works 
 best on machine-written messages and application outputs, typically on data that 
 consists of repeated elements, for example log messages for the purpose of 
 system troubleshooting. Log categorization groups unstructured log messages into 
 categories, then you can use {anomaly-detect} to model and identify rare or 
 unusual counts of log message categories.
 Categorization is tuned to work best on data like log messages by taking token
 order into account, not considering synonyms, and including stop words in its 
 analysis. Complete sentences in human communication or literary text (for 
 example emails, wiki pages, prose, or other human generated content) can be 
 extremely diverse in structure.  Since categorization is tuned for machine data 
 it will give poor results on such human generated data. For example, the 
 categorization job would create so many categories that couldn't be handled 
 effectively.  Categorization is _not_ natural language processing (NLP).
 [float]
 [[ml-categorization-log-messages]]
 ==== Categorizing log messages
 Application log events are often unstructured and contain variable data. For
 example:
@ -65,8 +87,8 @@ defining categories. The categorization filters are applied in the order they
 are listed in the job configuration, which allows you to disregard multiple
 sections of the categorization field value. In this example, we have decided that
 we do not want the detailed SQL to be considered in the message categorization.
-This particular categorization filter removes the SQL statement from the categorization
+This particular categorization filter removes the SQL statement from the 
-algorithm.
+categorization algorithm.
 If your data is stored in {es}, you can create an advanced {anomaly-job} with
 these same properties:
@ -79,7 +101,7 @@ NOTE: To add the `categorization_examples_limit` property, you must use the
 [float]
 [[ml-configuring-analyzer]]
-==== Customizing the categorization analyzer
+===== Customizing the categorization analyzer
 Categorization uses English dictionary words to identify log message categories.
 By default, it also uses English tokenization rules. For this reason, if you use
@ -135,7 +157,8 @@ here achieves exactly the same as the `categorization_filters` in the first
 example.
 <2> The `ml_classic` tokenizer works like the non-customizable tokenization
 that was used for categorization in older versions of machine learning. If you
-want the same categorization behavior as older versions, use this property value.
+want the same categorization behavior as older versions, use this property 
 value.
 <3> By default, English day or month words are filtered from log messages before
 categorization. If your logs are in a different language and contain
 dates, you might get better results by filtering the day or month words in your
@ -178,9 +201,9 @@ POST _ml/anomaly_detectors/_validate
 If you specify any part of the `categorization_analyzer`, however, any omitted
 sub-properties are _not_ set to default values.
-The `ml_classic` tokenizer and the day and month stopword filter are more or less
+The `ml_classic` tokenizer and the day and month stopword filter are more or 
-equivalent to the following analyzer, which is defined using only built-in {es}
+less equivalent to the following analyzer, which is defined using only built-in 
-{ref}/analysis-tokenizers.html[tokenizers] and
+{es} {ref}/analysis-tokenizers.html[tokenizers] and
 {ref}/analysis-tokenfilters.html[token filters]:
 [source,console]
@ -234,11 +257,11 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
 <4> Underscores, hyphens, and dots are removed from the beginning of tokens.
 <5> Underscores, hyphens, and dots are also removed from the end of tokens.
-The key difference between the default `categorization_analyzer` and this example
+The key difference between the default `categorization_analyzer` and this 
-analyzer is that using the `ml_classic` tokenizer is several times faster. The
+example analyzer is that using the `ml_classic` tokenizer is several times 
-difference in behavior is that this custom analyzer does not include accented
+faster. The difference in behavior is that this custom analyzer does not include 
-letters in tokens whereas the `ml_classic` tokenizer does, although that could
+accented letters in tokens whereas the `ml_classic` tokenizer does, although 
-be fixed by using more complex regular expressions.
+that could be fixed by using more complex regular expressions.
 If you are categorizing non-English messages in a language where words are
 separated by spaces, you might get better results if you change the day or month
@ -263,7 +286,7 @@ API examples above.
 [float]
 [[ml-viewing-categories]]
-==== Viewing categorization results
+===== Viewing categorization results
 After you open the job and start the {dfeed} or supply data to the job, you can
 view the categorization results in {kib}. For example: