[DOCS] Improves find_file_structure documentation (#50743)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
This commit is contained in:
parent
4a7e09f624
commit
acd73dda1c
|
@ -11,11 +11,13 @@ experimental[]
|
|||
Finds the structure of a text file. The text file must contain data that is
|
||||
suitable to be ingested into {es}.
|
||||
|
||||
|
||||
[[ml-find-file-structure-request]]
|
||||
==== {api-request-title}
|
||||
|
||||
`POST _ml/find_file_structure`
|
||||
|
||||
|
||||
[[ml-find-file-structure-prereqs]]
|
||||
==== {api-prereq-title}
|
||||
|
||||
|
@ -23,6 +25,7 @@ suitable to be ingested into {es}.
|
|||
`monitor` cluster privileges to use this API. See
|
||||
<<security-privileges>>.
|
||||
|
||||
|
||||
[[ml-find-file-structure-desc]]
|
||||
==== {api-description-title}
|
||||
|
||||
|
@ -55,41 +58,42 @@ specify the `explain` query parameter. It causes an `explanation` to appear in
|
|||
the response, which should help in determining why the returned structure was
|
||||
chosen.
|
||||
|
||||
|
||||
[[ml-find-file-structure-query-parms]]
|
||||
==== {api-query-parms-title}
|
||||
|
||||
`charset`::
|
||||
(string) Optional. The file's character set. It must be a character set that
|
||||
(Optional, string) The file's character set. It must be a character set that
|
||||
is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
|
||||
`windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
|
||||
finder chooses an appropriate character set.
|
||||
|
||||
`column_names`::
|
||||
(string) Optional. If you have set `format` to `delimited`, you can specify
|
||||
(Optional, string) If you have set `format` to `delimited`, you can specify
|
||||
the column names in a comma-separated list. If this parameter is not specified,
|
||||
the structure finder uses the column names from the header row of the file. If
|
||||
the file does not have a header role, columns are named "column1", "column2",
|
||||
"column3", etc.
|
||||
|
||||
`delimiter`::
|
||||
(string) Optional. If you have set `format` to `delimited`, you can specify
|
||||
(Optional, string) If you have set `format` to `delimited`, you can specify
|
||||
the character used to delimit the values in each row. Only a single character
|
||||
is supported; the delimiter cannot have multiple characters. If this parameter
|
||||
is not specified, the structure finder considers the following possibilities:
|
||||
comma, tab, semi-colon, and pipe (`|`).
|
||||
|
||||
`explain`::
|
||||
(boolean) Optional. If this parameter is set to `true`, the response includes
|
||||
(Optional, boolean) If this parameter is set to `true`, the response includes
|
||||
a field named `explanation`, which is an array of strings that indicate how
|
||||
the structure finder produced its result. The default value is `false`.
|
||||
|
||||
`format`::
|
||||
(string) Optional. The high level structure of the file. Valid values are
|
||||
(Optional, string) The high level structure of the file. Valid values are
|
||||
`ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is
|
||||
not specified, the structure finder chooses one.
|
||||
|
||||
`grok_pattern`::
|
||||
(string) Optional. If you have set `format` to `semi_structured_text`, you can
|
||||
(Optional, string) If you have set `format` to `semi_structured_text`, you can
|
||||
specify a Grok pattern that is used to extract fields from every message in
|
||||
the file. The name of the timestamp field in the Grok pattern must match what
|
||||
is specified in the `timestamp_field` parameter. If that parameter is not
|
||||
|
@ -98,20 +102,20 @@ chosen.
|
|||
a Grok pattern.
|
||||
|
||||
`has_header_row`::
|
||||
(boolean) Optional. If you have set `format` to `delimited`, you can use this
|
||||
(Optional, boolean) If you have set `format` to `delimited`, you can use this
|
||||
parameter to indicate whether the column names are in the first row of the
|
||||
file. If this parameter is not specified, the structure finder guesses based
|
||||
on the similarity of the first row of the file to other rows.
|
||||
|
||||
`line_merge_size_limit`::
|
||||
(unsigned integer) Optional. The maximum number of characters in a message
|
||||
(Optional, unsigned integer) The maximum number of characters in a message
|
||||
when lines are merged to form messages while analyzing semi-structured files.
|
||||
The default is `10000`. If you have extremely long messages you may need to
|
||||
increase this, but be aware that this may lead to very long processing times
|
||||
if the way to group lines into messages is misdetected.
|
||||
|
||||
`lines_to_sample`::
|
||||
(unsigned integer) Optional. The number of lines to include in the structural
|
||||
(Optional, unsigned integer) The number of lines to include in the structural
|
||||
analysis, starting from the beginning of the file. The minimum is 2; the
|
||||
default is `1000`. If the value of this parameter is greater than the number
|
||||
of lines in the file, the analysis proceeds (as long as there are at least two
|
||||
|
@ -127,7 +131,7 @@ to request analysis of 100000 lines to achieve some variety.
|
|||
--
|
||||
|
||||
`quote`::
|
||||
(string) Optional. If you have set `format` to `delimited`, you can specify
|
||||
(Optional, string) If you have set `format` to `delimited`, you can specify
|
||||
the character used to quote the values in each row if they contain newlines or
|
||||
the delimiter character. Only a single character is supported. If this
|
||||
parameter is not specified, the default value is a double quote (`"`). If your
|
||||
|
@ -135,18 +139,18 @@ to request analysis of 100000 lines to achieve some variety.
|
|||
argument to a character that does not appear anywhere in the sample.
|
||||
|
||||
`should_trim_fields`::
|
||||
(boolean) Optional. If you have set `format` to `delimited`, you can specify
|
||||
(Optional, boolean) If you have set `format` to `delimited`, you can specify
|
||||
whether values between delimiters should have whitespace trimmed from them. If
|
||||
this parameter is not specified and the delimiter is pipe (`|`), the default
|
||||
value is `true`. Otherwise, the default value is `false`.
|
||||
|
||||
`timeout`::
|
||||
(time) Optional. Sets the maximum amount of time that the structure analysis
|
||||
make take. If the analysis is still running when the timeout expires then it
|
||||
will be aborted. The default value is 25 seconds.
|
||||
(Optional, <<time-units,time units>>) Sets the maximum amount of time that the
|
||||
structure analysis make take. If the analysis is still running when the
|
||||
timeout expires then it will be aborted. The default value is 25 seconds.
|
||||
|
||||
`timestamp_field`::
|
||||
(string) Optional. The name of the field that contains the primary timestamp
|
||||
(Optional, string) The name of the field that contains the primary timestamp
|
||||
of each record in the file. In particular, if the file were ingested into an
|
||||
index, this is the field that would be used to populate the `@timestamp` field.
|
||||
+
|
||||
|
@ -159,16 +163,16 @@ also specified.
|
|||
For structured file formats, if you specify this parameter, the field must exist
|
||||
within the file.
|
||||
|
||||
If this parameter is not specified, the structure finder makes a decision about which
|
||||
field (if any) is the primary timestamp field. For structured file formats, it
|
||||
is not compulsory to have a timestamp in the file.
|
||||
If this parameter is not specified, the structure finder makes a decision about
|
||||
which field (if any) is the primary timestamp field. For structured file
|
||||
formats, it is not compulsory to have a timestamp in the file.
|
||||
--
|
||||
|
||||
`timestamp_format`::
|
||||
(string) Optional. The Java time format of the timestamp field in the file. +
|
||||
(Optional, string) The Java time format of the timestamp field in the file.
|
||||
+
|
||||
--
|
||||
NOTE: Only a subset of Java time format letter groups are supported:
|
||||
Only a subset of Java time format letter groups are supported:
|
||||
|
||||
* `a`
|
||||
* `d`
|
||||
|
@ -206,6 +210,20 @@ structure finder does not consider by default.
|
|||
If this parameter is not specified, the structure finder chooses the best
|
||||
format from a built-in set.
|
||||
|
||||
The following table provides the appropriate `timeformat` values for some example timestamps:
|
||||
|
||||
|===
|
||||
| Timeformat | Presentation
|
||||
|
||||
| yyyy-MM-dd HH:mm:ssZ | 2019-04-20 13:15:22+0000
|
||||
| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000
|
||||
| dd.MM.yy HH:mm:ss.SSS | 20.04.19 13:15:22.285
|
||||
|===
|
||||
|
||||
See
|
||||
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation]
|
||||
for more information about date and time format syntax.
|
||||
|
||||
--
|
||||
|
||||
[[ml-find-file-structure-request-body]]
|
||||
|
@ -219,6 +237,9 @@ size, which defaults to 100 Mb.
|
|||
[[ml-find-file-structure-examples]]
|
||||
==== {api-examples-title}
|
||||
|
||||
[[ml-find-file-structure-example-nld-json]]
|
||||
===== Ingesting newline-delimited JSON
|
||||
|
||||
Suppose you have a newline-delimited JSON file that contains information about
|
||||
some books. You can send the contents to the `find_file_structure` endpoint:
|
||||
|
||||
|
@ -527,6 +548,10 @@ If the request does not encounter errors, you receive the following result:
|
|||
may provide clues that the data needs to be cleaned or transformed prior
|
||||
to use by other {ml} functionality.
|
||||
|
||||
|
||||
[[ml-find-file-structure-example-nyc]]
|
||||
===== Finding the structure of NYC yellow cab example data
|
||||
|
||||
The next example shows how it's possible to find the structure of some New York
|
||||
City yellow cab trip data. The first `curl` command downloads the data, the
|
||||
first 20000 lines of which are then piped into the `find_file_structure`
|
||||
|
@ -543,7 +568,7 @@ curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head
|
|||
--
|
||||
NOTE: The `Content-Type: application/json` header must be set even though in
|
||||
this case the data is not JSON. (Alternatively the `Content-Type` can be set
|
||||
to any other supported by Elasticsearch, but it must be set.)
|
||||
to any other supported by {es}, but it must be set.)
|
||||
--
|
||||
|
||||
If the request does not encounter errors, you receive the following result:
|
||||
|
@ -1333,6 +1358,10 @@ If the request does not encounter errors, you receive the following result:
|
|||
necessary to supply the timezone they relate to. `need_client_timezone`
|
||||
will be `false` for timestamp formats that include the timezone.
|
||||
|
||||
|
||||
[[ml-find-file-structure-example-timeout]]
|
||||
===== Setting the timeout parameter
|
||||
|
||||
If you try to analyze a lot of data then the analysis will take a long time.
|
||||
If you want to limit the amount of processing your {es} cluster performs for
|
||||
a request, use the `timeout` query parameter. The analysis will be aborted and
|
||||
|
@ -1375,6 +1404,10 @@ and the timeout is measured from the time this endpoint starts to process the
|
|||
data.
|
||||
--
|
||||
|
||||
|
||||
[[ml-find-file-structure-example-eslog]]
|
||||
===== Analyzing {es} log files
|
||||
|
||||
This is an example of analyzing {es}'s own log file:
|
||||
|
||||
[source,js]
|
||||
|
@ -1523,6 +1556,10 @@ this:
|
|||
and recognizable fields that appear in every analyzed message. In this case
|
||||
the only field that was recognized beyond the timestamp was the log level.
|
||||
|
||||
|
||||
[[ml-find-file-structure-example-grok]]
|
||||
===== Specifying `grok_pattern` as query parameter
|
||||
|
||||
If you recognize more fields than the simple `grok_pattern` produced by the
|
||||
structure finder unaided then you can resubmit the request specifying a more
|
||||
advanced `grok_pattern` as a query parameter and the structure finder will
|
||||
|
|
Loading…
Reference in New Issue