[DOCS] Improves find_file_structure documentation (#50743)

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2020-01-09 11:19:19 +01:00 · 2020-01-09 11:19:19 +01:00 · acd73dda1c
parent 4a7e09f624
commit acd73dda1c
1 changed files with 58 additions and 21 deletions
--- a/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc
+++ b/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc
@ -11,11 +11,13 @@ experimental[]
 Finds the structure of a text file. The text file must contain data that is
 suitable to be ingested into {es}.

+
 [[ml-find-file-structure-request]]
 ==== {api-request-title}

 `POST _ml/find_file_structure`

+
 [[ml-find-file-structure-prereqs]]
 ==== {api-prereq-title}

@ -23,6 +25,7 @@ suitable to be ingested into {es}.
 `monitor` cluster privileges to use this API. See
 <<security-privileges>>.

+
 [[ml-find-file-structure-desc]]
 ==== {api-description-title}

@ -55,41 +58,42 @@ specify the `explain` query parameter. It causes an `explanation` to appear in
 the response, which should help in determining why the returned structure was
 chosen.

+
 [[ml-find-file-structure-query-parms]]
 ==== {api-query-parms-title}

 `charset`::
-  (string) Optional. The file's character set. It must be a character set that
+  (Optional, string) The file's character set. It must be a character set that
  is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
  `windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
  finder chooses an appropriate character set.

 `column_names`::
-  (string) Optional. If you have set `format` to `delimited`, you can specify
+  (Optional, string) If you have set `format` to `delimited`, you can specify
  the column names in a comma-separated list. If this parameter is not specified,
  the structure finder uses the column names from the header row of the file. If
  the file does not have a header role, columns are named "column1", "column2",
  "column3", etc.

 `delimiter`::
-  (string) Optional. If you have set `format` to `delimited`, you can specify
+  (Optional, string) If you have set `format` to `delimited`, you can specify
  the character used to delimit the values in each row. Only a single character
  is supported; the delimiter cannot have multiple characters. If this parameter
  is not specified, the structure finder considers the following possibilities:
  comma, tab, semi-colon, and pipe (`|`).

 `explain`::
-  (boolean) Optional. If this parameter is set to `true`, the response includes
+  (Optional, boolean) If this parameter is set to `true`, the response includes
  a field named `explanation`, which is an array of strings that indicate how
  the structure finder produced its result. The default value is `false`.

 `format`::
-  (string) Optional. The high level structure of the file. Valid values are
+  (Optional, string) The high level structure of the file. Valid values are
  `ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is
  not specified, the structure finder chooses one.

 `grok_pattern`::
-  (string) Optional. If you have set `format` to `semi_structured_text`, you can
+  (Optional, string) If you have set `format` to `semi_structured_text`, you can
  specify a Grok pattern that is used to extract fields from every message in
  the file. The name of the timestamp field in the Grok pattern must match what
  is specified in the `timestamp_field` parameter. If that parameter is not
@ -98,20 +102,20 @@ chosen.
  a Grok pattern.

 `has_header_row`::
-  (boolean) Optional. If you have set `format` to `delimited`, you can use this
+  (Optional, boolean) If you have set `format` to `delimited`, you can use this
  parameter to indicate whether the column names are in the first row of the
  file. If this parameter is not specified, the structure finder guesses based
  on the similarity of the first row of the file to other rows.

 `line_merge_size_limit`::
-  (unsigned integer) Optional. The maximum number of characters in a message
+  (Optional, unsigned integer) The maximum number of characters in a message
  when lines are merged to form messages while analyzing semi-structured files.
  The default is `10000`. If you have extremely long messages you may need to
  increase this, but be aware that this may lead to very long processing times
  if the way to group lines into messages is misdetected.

 `lines_to_sample`::
-  (unsigned integer) Optional. The number of lines to include in the structural
+  (Optional, unsigned integer) The number of lines to include in the structural
  analysis, starting from the beginning of the file. The minimum is 2; the
  default is `1000`. If the value of this parameter is greater than the number
  of lines in the file, the analysis proceeds (as long as there are at least two
@ -127,7 +131,7 @@ to request analysis of 100000 lines to achieve some variety.
 --

 `quote`::
-  (string) Optional. If you have set `format` to `delimited`, you can specify
+  (Optional, string) If you have set `format` to `delimited`, you can specify
  the character used to quote the values in each row if they contain newlines or
  the delimiter character. Only a single character is supported. If this
  parameter is not specified, the default value is a double quote (`"`). If your
@ -135,18 +139,18 @@ to request analysis of 100000 lines to achieve some variety.
  argument to a character that does not appear anywhere in the sample.

 `should_trim_fields`::
-  (boolean) Optional. If you have set `format` to `delimited`, you can specify
+  (Optional, boolean) If you have set `format` to `delimited`, you can specify
  whether values between delimiters should have whitespace trimmed from them. If
  this parameter is not specified and the delimiter is pipe (`|`), the default
  value is `true`. Otherwise, the default value is `false`.

 `timeout`::
-  (time) Optional. Sets the maximum amount of time that the structure analysis
-  make take. If the analysis is still running when the timeout expires then it
-  will be aborted. The default value is 25 seconds.
+  (Optional, <<time-units,time units>>) Sets the maximum amount of time that the 
+  structure analysis make take. If the analysis is still running when the 
+  timeout expires then it will be aborted. The default value is 25 seconds.

 `timestamp_field`::
-  (string) Optional. The name of the field that contains the primary timestamp
+  (Optional, string) The name of the field that contains the primary timestamp
  of each record in the file. In particular, if the file were ingested into an
  index, this is the field that would be used to populate the `@timestamp` field.
 +
@ -159,16 +163,16 @@ also specified.
 For structured file formats, if you specify this parameter, the field must exist
 within the file.

-If this parameter is not specified, the structure finder makes a decision about which
-field (if any) is the primary timestamp field. For structured file formats, it
-is not compulsory to have a timestamp in the file.
+If this parameter is not specified, the structure finder makes a decision about 
+which field (if any) is the primary timestamp field. For structured file 
+formats, it is not compulsory to have a timestamp in the file.
 --

 `timestamp_format`::
-  (string) Optional. The Java time format of the timestamp field in the file. +
+  (Optional, string) The Java time format of the timestamp field in the file.
 +
 --
-NOTE: Only a subset of Java time format letter groups are supported:
+Only a subset of Java time format letter groups are supported:

 * `a`
 * `d`
@ -206,6 +210,20 @@ structure finder does not consider by default.
 If this parameter is not specified, the structure finder chooses the best
 format from a built-in set.

+The following table provides the appropriate `timeformat` values for some example timestamps:
+
+|===
+| Timeformat                 | Presentation 
+
+| yyyy-MM-dd HH:mm:ssZ       | 2019-04-20 13:15:22+0000
+| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000    
+| dd.MM.yy HH:mm:ss.SSS      | 20.04.19 13:15:22.285
+|===
+
+See 
+https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation]
+for more information about date and time format syntax.
+
 --

 [[ml-find-file-structure-request-body]]
@ -219,6 +237,9 @@ size, which defaults to 100 Mb.
 [[ml-find-file-structure-examples]]
 ==== {api-examples-title}

+[[ml-find-file-structure-example-nld-json]]
+===== Ingesting newline-delimited JSON
+
 Suppose you have a newline-delimited JSON file that contains information about
 some books. You can send the contents to the `find_file_structure` endpoint:

@ -527,6 +548,10 @@ If the request does not encounter errors, you receive the following result:
     may provide clues that the data needs to be cleaned or transformed prior
     to use by other {ml} functionality.

+
+[[ml-find-file-structure-example-nyc]]
+===== Finding the structure of NYC yellow cab example data
+
 The next example shows how it's possible to find the structure of some New York
 City yellow cab trip data. The first `curl` command downloads the data, the
 first 20000 lines of which are then piped into the `find_file_structure`
@ -543,7 +568,7 @@ curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head
 --
 NOTE: The `Content-Type: application/json` header must be set even though in
 this case the data is not JSON. (Alternatively the `Content-Type` can be set
-to any other supported by Elasticsearch, but it must be set.)
+to any other supported by {es}, but it must be set.)
 --

 If the request does not encounter errors, you receive the following result:
@ -1333,6 +1358,10 @@ If the request does not encounter errors, you receive the following result:
     necessary to supply the timezone they relate to. `need_client_timezone`
     will be `false` for timestamp formats that include the timezone.

+
+[[ml-find-file-structure-example-timeout]]
+===== Setting the timeout parameter
+
 If you try to analyze a lot of data then the analysis will take a long time.
 If you want to limit the amount of processing your {es} cluster performs for
 a request, use the `timeout` query parameter. The analysis will be aborted and
@ -1375,6 +1404,10 @@ and the timeout is measured from the time this endpoint starts to process the
 data.
 --

+
+[[ml-find-file-structure-example-eslog]]
+===== Analyzing {es} log files
+
 This is an example of analyzing {es}'s own log file:

 [source,js]
@ -1523,6 +1556,10 @@ this:
    and recognizable fields that appear in every analyzed message. In this case
    the only field that was recognized beyond the timestamp was the log level.

+
+[[ml-find-file-structure-example-grok]]
+===== Specifying `grok_pattern` as query parameter
+
 If you recognize more fields than the simple `grok_pattern` produced by the
 structure finder unaided then you can resubmit the request specifying a more
 advanced `grok_pattern` as a query parameter and the structure finder will