From acd73dda1c5c6462eb6ffb56c3cc44b2a1d72e17 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Thu, 9 Jan 2020 11:19:19 +0100 Subject: [PATCH] [DOCS] Improves find_file_structure documentation (#50743) Co-authored-by: Lisa Cawley --- .../apis/find-file-structure.asciidoc | 79 ++++++++++++++----- 1 file changed, 58 insertions(+), 21 deletions(-) diff --git a/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc b/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc index 52afbab9c70..4f85e39d60a 100644 --- a/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc +++ b/docs/reference/ml/anomaly-detection/apis/find-file-structure.asciidoc @@ -11,11 +11,13 @@ experimental[] Finds the structure of a text file. The text file must contain data that is suitable to be ingested into {es}. + [[ml-find-file-structure-request]] ==== {api-request-title} `POST _ml/find_file_structure` + [[ml-find-file-structure-prereqs]] ==== {api-prereq-title} @@ -23,6 +25,7 @@ suitable to be ingested into {es}. `monitor` cluster privileges to use this API. See <>. + [[ml-find-file-structure-desc]] ==== {api-description-title} @@ -55,41 +58,42 @@ specify the `explain` query parameter. It causes an `explanation` to appear in the response, which should help in determining why the returned structure was chosen. + [[ml-find-file-structure-query-parms]] ==== {api-query-parms-title} `charset`:: - (string) Optional. The file's character set. It must be a character set that + (Optional, string) The file's character set. It must be a character set that is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`, `windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure finder chooses an appropriate character set. `column_names`:: - (string) Optional. If you have set `format` to `delimited`, you can specify + (Optional, string) If you have set `format` to `delimited`, you can specify the column names in a comma-separated list. If this parameter is not specified, the structure finder uses the column names from the header row of the file. If the file does not have a header role, columns are named "column1", "column2", "column3", etc. `delimiter`:: - (string) Optional. If you have set `format` to `delimited`, you can specify + (Optional, string) If you have set `format` to `delimited`, you can specify the character used to delimit the values in each row. Only a single character is supported; the delimiter cannot have multiple characters. If this parameter is not specified, the structure finder considers the following possibilities: comma, tab, semi-colon, and pipe (`|`). `explain`:: - (boolean) Optional. If this parameter is set to `true`, the response includes + (Optional, boolean) If this parameter is set to `true`, the response includes a field named `explanation`, which is an array of strings that indicate how the structure finder produced its result. The default value is `false`. `format`:: - (string) Optional. The high level structure of the file. Valid values are + (Optional, string) The high level structure of the file. Valid values are `ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is not specified, the structure finder chooses one. `grok_pattern`:: - (string) Optional. If you have set `format` to `semi_structured_text`, you can + (Optional, string) If you have set `format` to `semi_structured_text`, you can specify a Grok pattern that is used to extract fields from every message in the file. The name of the timestamp field in the Grok pattern must match what is specified in the `timestamp_field` parameter. If that parameter is not @@ -98,20 +102,20 @@ chosen. a Grok pattern. `has_header_row`:: - (boolean) Optional. If you have set `format` to `delimited`, you can use this + (Optional, boolean) If you have set `format` to `delimited`, you can use this parameter to indicate whether the column names are in the first row of the file. If this parameter is not specified, the structure finder guesses based on the similarity of the first row of the file to other rows. `line_merge_size_limit`:: - (unsigned integer) Optional. The maximum number of characters in a message + (Optional, unsigned integer) The maximum number of characters in a message when lines are merged to form messages while analyzing semi-structured files. The default is `10000`. If you have extremely long messages you may need to increase this, but be aware that this may lead to very long processing times if the way to group lines into messages is misdetected. `lines_to_sample`:: - (unsigned integer) Optional. The number of lines to include in the structural + (Optional, unsigned integer) The number of lines to include in the structural analysis, starting from the beginning of the file. The minimum is 2; the default is `1000`. If the value of this parameter is greater than the number of lines in the file, the analysis proceeds (as long as there are at least two @@ -127,7 +131,7 @@ to request analysis of 100000 lines to achieve some variety. -- `quote`:: - (string) Optional. If you have set `format` to `delimited`, you can specify + (Optional, string) If you have set `format` to `delimited`, you can specify the character used to quote the values in each row if they contain newlines or the delimiter character. Only a single character is supported. If this parameter is not specified, the default value is a double quote (`"`). If your @@ -135,18 +139,18 @@ to request analysis of 100000 lines to achieve some variety. argument to a character that does not appear anywhere in the sample. `should_trim_fields`:: - (boolean) Optional. If you have set `format` to `delimited`, you can specify + (Optional, boolean) If you have set `format` to `delimited`, you can specify whether values between delimiters should have whitespace trimmed from them. If this parameter is not specified and the delimiter is pipe (`|`), the default value is `true`. Otherwise, the default value is `false`. `timeout`:: - (time) Optional. Sets the maximum amount of time that the structure analysis - make take. If the analysis is still running when the timeout expires then it - will be aborted. The default value is 25 seconds. + (Optional, <>) Sets the maximum amount of time that the + structure analysis make take. If the analysis is still running when the + timeout expires then it will be aborted. The default value is 25 seconds. `timestamp_field`:: - (string) Optional. The name of the field that contains the primary timestamp + (Optional, string) The name of the field that contains the primary timestamp of each record in the file. In particular, if the file were ingested into an index, this is the field that would be used to populate the `@timestamp` field. + @@ -159,16 +163,16 @@ also specified. For structured file formats, if you specify this parameter, the field must exist within the file. -If this parameter is not specified, the structure finder makes a decision about which -field (if any) is the primary timestamp field. For structured file formats, it -is not compulsory to have a timestamp in the file. +If this parameter is not specified, the structure finder makes a decision about +which field (if any) is the primary timestamp field. For structured file +formats, it is not compulsory to have a timestamp in the file. -- `timestamp_format`:: - (string) Optional. The Java time format of the timestamp field in the file. + + (Optional, string) The Java time format of the timestamp field in the file. + -- -NOTE: Only a subset of Java time format letter groups are supported: +Only a subset of Java time format letter groups are supported: * `a` * `d` @@ -206,6 +210,20 @@ structure finder does not consider by default. If this parameter is not specified, the structure finder chooses the best format from a built-in set. +The following table provides the appropriate `timeformat` values for some example timestamps: + +|=== +| Timeformat | Presentation + +| yyyy-MM-dd HH:mm:ssZ | 2019-04-20 13:15:22+0000 +| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000 +| dd.MM.yy HH:mm:ss.SSS | 20.04.19 13:15:22.285 +|=== + +See +https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation] +for more information about date and time format syntax. + -- [[ml-find-file-structure-request-body]] @@ -219,6 +237,9 @@ size, which defaults to 100 Mb. [[ml-find-file-structure-examples]] ==== {api-examples-title} +[[ml-find-file-structure-example-nld-json]] +===== Ingesting newline-delimited JSON + Suppose you have a newline-delimited JSON file that contains information about some books. You can send the contents to the `find_file_structure` endpoint: @@ -527,6 +548,10 @@ If the request does not encounter errors, you receive the following result: may provide clues that the data needs to be cleaned or transformed prior to use by other {ml} functionality. + +[[ml-find-file-structure-example-nyc]] +===== Finding the structure of NYC yellow cab example data + The next example shows how it's possible to find the structure of some New York City yellow cab trip data. The first `curl` command downloads the data, the first 20000 lines of which are then piped into the `find_file_structure` @@ -543,7 +568,7 @@ curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head -- NOTE: The `Content-Type: application/json` header must be set even though in this case the data is not JSON. (Alternatively the `Content-Type` can be set -to any other supported by Elasticsearch, but it must be set.) +to any other supported by {es}, but it must be set.) -- If the request does not encounter errors, you receive the following result: @@ -1333,6 +1358,10 @@ If the request does not encounter errors, you receive the following result: necessary to supply the timezone they relate to. `need_client_timezone` will be `false` for timestamp formats that include the timezone. + +[[ml-find-file-structure-example-timeout]] +===== Setting the timeout parameter + If you try to analyze a lot of data then the analysis will take a long time. If you want to limit the amount of processing your {es} cluster performs for a request, use the `timeout` query parameter. The analysis will be aborted and @@ -1375,6 +1404,10 @@ and the timeout is measured from the time this endpoint starts to process the data. -- + +[[ml-find-file-structure-example-eslog]] +===== Analyzing {es} log files + This is an example of analyzing {es}'s own log file: [source,js] @@ -1523,6 +1556,10 @@ this: and recognizable fields that appear in every analyzed message. In this case the only field that was recognized beyond the timestamp was the log level. + +[[ml-find-file-structure-example-grok]] +===== Specifying `grok_pattern` as query parameter + If you recognize more fields than the simple `grok_pattern` produced by the structure finder unaided then you can resubmit the request specifying a more advanced `grok_pattern` as a query parameter and the structure finder will