[DOCS] Improves find_file_structure documentation (#50743)

Co-authored-by: Lisa Cawley <lcawley@elastic.co>
This commit is contained in:
István Zoltán Szabó 2020-01-09 11:19:19 +01:00
parent 4a7e09f624
commit acd73dda1c
1 changed files with 58 additions and 21 deletions

View File

@ -11,11 +11,13 @@ experimental[]
Finds the structure of a text file. The text file must contain data that is Finds the structure of a text file. The text file must contain data that is
suitable to be ingested into {es}. suitable to be ingested into {es}.
[[ml-find-file-structure-request]] [[ml-find-file-structure-request]]
==== {api-request-title} ==== {api-request-title}
`POST _ml/find_file_structure` `POST _ml/find_file_structure`
[[ml-find-file-structure-prereqs]] [[ml-find-file-structure-prereqs]]
==== {api-prereq-title} ==== {api-prereq-title}
@ -23,6 +25,7 @@ suitable to be ingested into {es}.
`monitor` cluster privileges to use this API. See `monitor` cluster privileges to use this API. See
<<security-privileges>>. <<security-privileges>>.
[[ml-find-file-structure-desc]] [[ml-find-file-structure-desc]]
==== {api-description-title} ==== {api-description-title}
@ -55,41 +58,42 @@ specify the `explain` query parameter. It causes an `explanation` to appear in
the response, which should help in determining why the returned structure was the response, which should help in determining why the returned structure was
chosen. chosen.
[[ml-find-file-structure-query-parms]] [[ml-find-file-structure-query-parms]]
==== {api-query-parms-title} ==== {api-query-parms-title}
`charset`:: `charset`::
(string) Optional. The file's character set. It must be a character set that (Optional, string) The file's character set. It must be a character set that
is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`, is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
`windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure `windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
finder chooses an appropriate character set. finder chooses an appropriate character set.
`column_names`:: `column_names`::
(string) Optional. If you have set `format` to `delimited`, you can specify (Optional, string) If you have set `format` to `delimited`, you can specify
the column names in a comma-separated list. If this parameter is not specified, the column names in a comma-separated list. If this parameter is not specified,
the structure finder uses the column names from the header row of the file. If the structure finder uses the column names from the header row of the file. If
the file does not have a header role, columns are named "column1", "column2", the file does not have a header role, columns are named "column1", "column2",
"column3", etc. "column3", etc.
`delimiter`:: `delimiter`::
(string) Optional. If you have set `format` to `delimited`, you can specify (Optional, string) If you have set `format` to `delimited`, you can specify
the character used to delimit the values in each row. Only a single character the character used to delimit the values in each row. Only a single character
is supported; the delimiter cannot have multiple characters. If this parameter is supported; the delimiter cannot have multiple characters. If this parameter
is not specified, the structure finder considers the following possibilities: is not specified, the structure finder considers the following possibilities:
comma, tab, semi-colon, and pipe (`|`). comma, tab, semi-colon, and pipe (`|`).
`explain`:: `explain`::
(boolean) Optional. If this parameter is set to `true`, the response includes (Optional, boolean) If this parameter is set to `true`, the response includes
a field named `explanation`, which is an array of strings that indicate how a field named `explanation`, which is an array of strings that indicate how
the structure finder produced its result. The default value is `false`. the structure finder produced its result. The default value is `false`.
`format`:: `format`::
(string) Optional. The high level structure of the file. Valid values are (Optional, string) The high level structure of the file. Valid values are
`ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is `ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is
not specified, the structure finder chooses one. not specified, the structure finder chooses one.
`grok_pattern`:: `grok_pattern`::
(string) Optional. If you have set `format` to `semi_structured_text`, you can (Optional, string) If you have set `format` to `semi_structured_text`, you can
specify a Grok pattern that is used to extract fields from every message in specify a Grok pattern that is used to extract fields from every message in
the file. The name of the timestamp field in the Grok pattern must match what the file. The name of the timestamp field in the Grok pattern must match what
is specified in the `timestamp_field` parameter. If that parameter is not is specified in the `timestamp_field` parameter. If that parameter is not
@ -98,20 +102,20 @@ chosen.
a Grok pattern. a Grok pattern.
`has_header_row`:: `has_header_row`::
(boolean) Optional. If you have set `format` to `delimited`, you can use this (Optional, boolean) If you have set `format` to `delimited`, you can use this
parameter to indicate whether the column names are in the first row of the parameter to indicate whether the column names are in the first row of the
file. If this parameter is not specified, the structure finder guesses based file. If this parameter is not specified, the structure finder guesses based
on the similarity of the first row of the file to other rows. on the similarity of the first row of the file to other rows.
`line_merge_size_limit`:: `line_merge_size_limit`::
(unsigned integer) Optional. The maximum number of characters in a message (Optional, unsigned integer) The maximum number of characters in a message
when lines are merged to form messages while analyzing semi-structured files. when lines are merged to form messages while analyzing semi-structured files.
The default is `10000`. If you have extremely long messages you may need to The default is `10000`. If you have extremely long messages you may need to
increase this, but be aware that this may lead to very long processing times increase this, but be aware that this may lead to very long processing times
if the way to group lines into messages is misdetected. if the way to group lines into messages is misdetected.
`lines_to_sample`:: `lines_to_sample`::
(unsigned integer) Optional. The number of lines to include in the structural (Optional, unsigned integer) The number of lines to include in the structural
analysis, starting from the beginning of the file. The minimum is 2; the analysis, starting from the beginning of the file. The minimum is 2; the
default is `1000`. If the value of this parameter is greater than the number default is `1000`. If the value of this parameter is greater than the number
of lines in the file, the analysis proceeds (as long as there are at least two of lines in the file, the analysis proceeds (as long as there are at least two
@ -127,7 +131,7 @@ to request analysis of 100000 lines to achieve some variety.
-- --
`quote`:: `quote`::
(string) Optional. If you have set `format` to `delimited`, you can specify (Optional, string) If you have set `format` to `delimited`, you can specify
the character used to quote the values in each row if they contain newlines or the character used to quote the values in each row if they contain newlines or
the delimiter character. Only a single character is supported. If this the delimiter character. Only a single character is supported. If this
parameter is not specified, the default value is a double quote (`"`). If your parameter is not specified, the default value is a double quote (`"`). If your
@ -135,18 +139,18 @@ to request analysis of 100000 lines to achieve some variety.
argument to a character that does not appear anywhere in the sample. argument to a character that does not appear anywhere in the sample.
`should_trim_fields`:: `should_trim_fields`::
(boolean) Optional. If you have set `format` to `delimited`, you can specify (Optional, boolean) If you have set `format` to `delimited`, you can specify
whether values between delimiters should have whitespace trimmed from them. If whether values between delimiters should have whitespace trimmed from them. If
this parameter is not specified and the delimiter is pipe (`|`), the default this parameter is not specified and the delimiter is pipe (`|`), the default
value is `true`. Otherwise, the default value is `false`. value is `true`. Otherwise, the default value is `false`.
`timeout`:: `timeout`::
(time) Optional. Sets the maximum amount of time that the structure analysis (Optional, <<time-units,time units>>) Sets the maximum amount of time that the
make take. If the analysis is still running when the timeout expires then it structure analysis make take. If the analysis is still running when the
will be aborted. The default value is 25 seconds. timeout expires then it will be aborted. The default value is 25 seconds.
`timestamp_field`:: `timestamp_field`::
(string) Optional. The name of the field that contains the primary timestamp (Optional, string) The name of the field that contains the primary timestamp
of each record in the file. In particular, if the file were ingested into an of each record in the file. In particular, if the file were ingested into an
index, this is the field that would be used to populate the `@timestamp` field. index, this is the field that would be used to populate the `@timestamp` field.
+ +
@ -159,16 +163,16 @@ also specified.
For structured file formats, if you specify this parameter, the field must exist For structured file formats, if you specify this parameter, the field must exist
within the file. within the file.
If this parameter is not specified, the structure finder makes a decision about which If this parameter is not specified, the structure finder makes a decision about
field (if any) is the primary timestamp field. For structured file formats, it which field (if any) is the primary timestamp field. For structured file
is not compulsory to have a timestamp in the file. formats, it is not compulsory to have a timestamp in the file.
-- --
`timestamp_format`:: `timestamp_format`::
(string) Optional. The Java time format of the timestamp field in the file. + (Optional, string) The Java time format of the timestamp field in the file.
+ +
-- --
NOTE: Only a subset of Java time format letter groups are supported: Only a subset of Java time format letter groups are supported:
* `a` * `a`
* `d` * `d`
@ -206,6 +210,20 @@ structure finder does not consider by default.
If this parameter is not specified, the structure finder chooses the best If this parameter is not specified, the structure finder chooses the best
format from a built-in set. format from a built-in set.
The following table provides the appropriate `timeformat` values for some example timestamps:
|===
| Timeformat | Presentation
| yyyy-MM-dd HH:mm:ssZ | 2019-04-20 13:15:22+0000
| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000
| dd.MM.yy HH:mm:ss.SSS | 20.04.19 13:15:22.285
|===
See
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation]
for more information about date and time format syntax.
-- --
[[ml-find-file-structure-request-body]] [[ml-find-file-structure-request-body]]
@ -219,6 +237,9 @@ size, which defaults to 100 Mb.
[[ml-find-file-structure-examples]] [[ml-find-file-structure-examples]]
==== {api-examples-title} ==== {api-examples-title}
[[ml-find-file-structure-example-nld-json]]
===== Ingesting newline-delimited JSON
Suppose you have a newline-delimited JSON file that contains information about Suppose you have a newline-delimited JSON file that contains information about
some books. You can send the contents to the `find_file_structure` endpoint: some books. You can send the contents to the `find_file_structure` endpoint:
@ -527,6 +548,10 @@ If the request does not encounter errors, you receive the following result:
may provide clues that the data needs to be cleaned or transformed prior may provide clues that the data needs to be cleaned or transformed prior
to use by other {ml} functionality. to use by other {ml} functionality.
[[ml-find-file-structure-example-nyc]]
===== Finding the structure of NYC yellow cab example data
The next example shows how it's possible to find the structure of some New York The next example shows how it's possible to find the structure of some New York
City yellow cab trip data. The first `curl` command downloads the data, the City yellow cab trip data. The first `curl` command downloads the data, the
first 20000 lines of which are then piped into the `find_file_structure` first 20000 lines of which are then piped into the `find_file_structure`
@ -543,7 +568,7 @@ curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head
-- --
NOTE: The `Content-Type: application/json` header must be set even though in NOTE: The `Content-Type: application/json` header must be set even though in
this case the data is not JSON. (Alternatively the `Content-Type` can be set this case the data is not JSON. (Alternatively the `Content-Type` can be set
to any other supported by Elasticsearch, but it must be set.) to any other supported by {es}, but it must be set.)
-- --
If the request does not encounter errors, you receive the following result: If the request does not encounter errors, you receive the following result:
@ -1333,6 +1358,10 @@ If the request does not encounter errors, you receive the following result:
necessary to supply the timezone they relate to. `need_client_timezone` necessary to supply the timezone they relate to. `need_client_timezone`
will be `false` for timestamp formats that include the timezone. will be `false` for timestamp formats that include the timezone.
[[ml-find-file-structure-example-timeout]]
===== Setting the timeout parameter
If you try to analyze a lot of data then the analysis will take a long time. If you try to analyze a lot of data then the analysis will take a long time.
If you want to limit the amount of processing your {es} cluster performs for If you want to limit the amount of processing your {es} cluster performs for
a request, use the `timeout` query parameter. The analysis will be aborted and a request, use the `timeout` query parameter. The analysis will be aborted and
@ -1375,6 +1404,10 @@ and the timeout is measured from the time this endpoint starts to process the
data. data.
-- --
[[ml-find-file-structure-example-eslog]]
===== Analyzing {es} log files
This is an example of analyzing {es}'s own log file: This is an example of analyzing {es}'s own log file:
[source,js] [source,js]
@ -1523,6 +1556,10 @@ this:
and recognizable fields that appear in every analyzed message. In this case and recognizable fields that appear in every analyzed message. In this case
the only field that was recognized beyond the timestamp was the log level. the only field that was recognized beyond the timestamp was the log level.
[[ml-find-file-structure-example-grok]]
===== Specifying `grok_pattern` as query parameter
If you recognize more fields than the simple `grok_pattern` produced by the If you recognize more fields than the simple `grok_pattern` produced by the
structure finder unaided then you can resubmit the request specifying a more structure finder unaided then you can resubmit the request specifying a more
advanced `grok_pattern` as a query parameter and the structure finder will advanced `grok_pattern` as a query parameter and the structure finder will