[DOCS] Improves find_file_structure documentation (#50743)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
This commit is contained in:
parent
4a7e09f624
commit
acd73dda1c
|
@ -11,11 +11,13 @@ experimental[]
|
||||||
Finds the structure of a text file. The text file must contain data that is
|
Finds the structure of a text file. The text file must contain data that is
|
||||||
suitable to be ingested into {es}.
|
suitable to be ingested into {es}.
|
||||||
|
|
||||||
|
|
||||||
[[ml-find-file-structure-request]]
|
[[ml-find-file-structure-request]]
|
||||||
==== {api-request-title}
|
==== {api-request-title}
|
||||||
|
|
||||||
`POST _ml/find_file_structure`
|
`POST _ml/find_file_structure`
|
||||||
|
|
||||||
|
|
||||||
[[ml-find-file-structure-prereqs]]
|
[[ml-find-file-structure-prereqs]]
|
||||||
==== {api-prereq-title}
|
==== {api-prereq-title}
|
||||||
|
|
||||||
|
@ -23,6 +25,7 @@ suitable to be ingested into {es}.
|
||||||
`monitor` cluster privileges to use this API. See
|
`monitor` cluster privileges to use this API. See
|
||||||
<<security-privileges>>.
|
<<security-privileges>>.
|
||||||
|
|
||||||
|
|
||||||
[[ml-find-file-structure-desc]]
|
[[ml-find-file-structure-desc]]
|
||||||
==== {api-description-title}
|
==== {api-description-title}
|
||||||
|
|
||||||
|
@ -55,41 +58,42 @@ specify the `explain` query parameter. It causes an `explanation` to appear in
|
||||||
the response, which should help in determining why the returned structure was
|
the response, which should help in determining why the returned structure was
|
||||||
chosen.
|
chosen.
|
||||||
|
|
||||||
|
|
||||||
[[ml-find-file-structure-query-parms]]
|
[[ml-find-file-structure-query-parms]]
|
||||||
==== {api-query-parms-title}
|
==== {api-query-parms-title}
|
||||||
|
|
||||||
`charset`::
|
`charset`::
|
||||||
(string) Optional. The file's character set. It must be a character set that
|
(Optional, string) The file's character set. It must be a character set that
|
||||||
is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
|
is supported by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`,
|
||||||
`windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
|
`windows-1252`, or `EUC-JP`. If this parameter is not specified, the structure
|
||||||
finder chooses an appropriate character set.
|
finder chooses an appropriate character set.
|
||||||
|
|
||||||
`column_names`::
|
`column_names`::
|
||||||
(string) Optional. If you have set `format` to `delimited`, you can specify
|
(Optional, string) If you have set `format` to `delimited`, you can specify
|
||||||
the column names in a comma-separated list. If this parameter is not specified,
|
the column names in a comma-separated list. If this parameter is not specified,
|
||||||
the structure finder uses the column names from the header row of the file. If
|
the structure finder uses the column names from the header row of the file. If
|
||||||
the file does not have a header role, columns are named "column1", "column2",
|
the file does not have a header role, columns are named "column1", "column2",
|
||||||
"column3", etc.
|
"column3", etc.
|
||||||
|
|
||||||
`delimiter`::
|
`delimiter`::
|
||||||
(string) Optional. If you have set `format` to `delimited`, you can specify
|
(Optional, string) If you have set `format` to `delimited`, you can specify
|
||||||
the character used to delimit the values in each row. Only a single character
|
the character used to delimit the values in each row. Only a single character
|
||||||
is supported; the delimiter cannot have multiple characters. If this parameter
|
is supported; the delimiter cannot have multiple characters. If this parameter
|
||||||
is not specified, the structure finder considers the following possibilities:
|
is not specified, the structure finder considers the following possibilities:
|
||||||
comma, tab, semi-colon, and pipe (`|`).
|
comma, tab, semi-colon, and pipe (`|`).
|
||||||
|
|
||||||
`explain`::
|
`explain`::
|
||||||
(boolean) Optional. If this parameter is set to `true`, the response includes
|
(Optional, boolean) If this parameter is set to `true`, the response includes
|
||||||
a field named `explanation`, which is an array of strings that indicate how
|
a field named `explanation`, which is an array of strings that indicate how
|
||||||
the structure finder produced its result. The default value is `false`.
|
the structure finder produced its result. The default value is `false`.
|
||||||
|
|
||||||
`format`::
|
`format`::
|
||||||
(string) Optional. The high level structure of the file. Valid values are
|
(Optional, string) The high level structure of the file. Valid values are
|
||||||
`ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is
|
`ndjson`, `xml`, `delimited`, and `semi_structured_text`. If this parameter is
|
||||||
not specified, the structure finder chooses one.
|
not specified, the structure finder chooses one.
|
||||||
|
|
||||||
`grok_pattern`::
|
`grok_pattern`::
|
||||||
(string) Optional. If you have set `format` to `semi_structured_text`, you can
|
(Optional, string) If you have set `format` to `semi_structured_text`, you can
|
||||||
specify a Grok pattern that is used to extract fields from every message in
|
specify a Grok pattern that is used to extract fields from every message in
|
||||||
the file. The name of the timestamp field in the Grok pattern must match what
|
the file. The name of the timestamp field in the Grok pattern must match what
|
||||||
is specified in the `timestamp_field` parameter. If that parameter is not
|
is specified in the `timestamp_field` parameter. If that parameter is not
|
||||||
|
@ -98,20 +102,20 @@ chosen.
|
||||||
a Grok pattern.
|
a Grok pattern.
|
||||||
|
|
||||||
`has_header_row`::
|
`has_header_row`::
|
||||||
(boolean) Optional. If you have set `format` to `delimited`, you can use this
|
(Optional, boolean) If you have set `format` to `delimited`, you can use this
|
||||||
parameter to indicate whether the column names are in the first row of the
|
parameter to indicate whether the column names are in the first row of the
|
||||||
file. If this parameter is not specified, the structure finder guesses based
|
file. If this parameter is not specified, the structure finder guesses based
|
||||||
on the similarity of the first row of the file to other rows.
|
on the similarity of the first row of the file to other rows.
|
||||||
|
|
||||||
`line_merge_size_limit`::
|
`line_merge_size_limit`::
|
||||||
(unsigned integer) Optional. The maximum number of characters in a message
|
(Optional, unsigned integer) The maximum number of characters in a message
|
||||||
when lines are merged to form messages while analyzing semi-structured files.
|
when lines are merged to form messages while analyzing semi-structured files.
|
||||||
The default is `10000`. If you have extremely long messages you may need to
|
The default is `10000`. If you have extremely long messages you may need to
|
||||||
increase this, but be aware that this may lead to very long processing times
|
increase this, but be aware that this may lead to very long processing times
|
||||||
if the way to group lines into messages is misdetected.
|
if the way to group lines into messages is misdetected.
|
||||||
|
|
||||||
`lines_to_sample`::
|
`lines_to_sample`::
|
||||||
(unsigned integer) Optional. The number of lines to include in the structural
|
(Optional, unsigned integer) The number of lines to include in the structural
|
||||||
analysis, starting from the beginning of the file. The minimum is 2; the
|
analysis, starting from the beginning of the file. The minimum is 2; the
|
||||||
default is `1000`. If the value of this parameter is greater than the number
|
default is `1000`. If the value of this parameter is greater than the number
|
||||||
of lines in the file, the analysis proceeds (as long as there are at least two
|
of lines in the file, the analysis proceeds (as long as there are at least two
|
||||||
|
@ -127,7 +131,7 @@ to request analysis of 100000 lines to achieve some variety.
|
||||||
--
|
--
|
||||||
|
|
||||||
`quote`::
|
`quote`::
|
||||||
(string) Optional. If you have set `format` to `delimited`, you can specify
|
(Optional, string) If you have set `format` to `delimited`, you can specify
|
||||||
the character used to quote the values in each row if they contain newlines or
|
the character used to quote the values in each row if they contain newlines or
|
||||||
the delimiter character. Only a single character is supported. If this
|
the delimiter character. Only a single character is supported. If this
|
||||||
parameter is not specified, the default value is a double quote (`"`). If your
|
parameter is not specified, the default value is a double quote (`"`). If your
|
||||||
|
@ -135,18 +139,18 @@ to request analysis of 100000 lines to achieve some variety.
|
||||||
argument to a character that does not appear anywhere in the sample.
|
argument to a character that does not appear anywhere in the sample.
|
||||||
|
|
||||||
`should_trim_fields`::
|
`should_trim_fields`::
|
||||||
(boolean) Optional. If you have set `format` to `delimited`, you can specify
|
(Optional, boolean) If you have set `format` to `delimited`, you can specify
|
||||||
whether values between delimiters should have whitespace trimmed from them. If
|
whether values between delimiters should have whitespace trimmed from them. If
|
||||||
this parameter is not specified and the delimiter is pipe (`|`), the default
|
this parameter is not specified and the delimiter is pipe (`|`), the default
|
||||||
value is `true`. Otherwise, the default value is `false`.
|
value is `true`. Otherwise, the default value is `false`.
|
||||||
|
|
||||||
`timeout`::
|
`timeout`::
|
||||||
(time) Optional. Sets the maximum amount of time that the structure analysis
|
(Optional, <<time-units,time units>>) Sets the maximum amount of time that the
|
||||||
make take. If the analysis is still running when the timeout expires then it
|
structure analysis make take. If the analysis is still running when the
|
||||||
will be aborted. The default value is 25 seconds.
|
timeout expires then it will be aborted. The default value is 25 seconds.
|
||||||
|
|
||||||
`timestamp_field`::
|
`timestamp_field`::
|
||||||
(string) Optional. The name of the field that contains the primary timestamp
|
(Optional, string) The name of the field that contains the primary timestamp
|
||||||
of each record in the file. In particular, if the file were ingested into an
|
of each record in the file. In particular, if the file were ingested into an
|
||||||
index, this is the field that would be used to populate the `@timestamp` field.
|
index, this is the field that would be used to populate the `@timestamp` field.
|
||||||
+
|
+
|
||||||
|
@ -159,16 +163,16 @@ also specified.
|
||||||
For structured file formats, if you specify this parameter, the field must exist
|
For structured file formats, if you specify this parameter, the field must exist
|
||||||
within the file.
|
within the file.
|
||||||
|
|
||||||
If this parameter is not specified, the structure finder makes a decision about which
|
If this parameter is not specified, the structure finder makes a decision about
|
||||||
field (if any) is the primary timestamp field. For structured file formats, it
|
which field (if any) is the primary timestamp field. For structured file
|
||||||
is not compulsory to have a timestamp in the file.
|
formats, it is not compulsory to have a timestamp in the file.
|
||||||
--
|
--
|
||||||
|
|
||||||
`timestamp_format`::
|
`timestamp_format`::
|
||||||
(string) Optional. The Java time format of the timestamp field in the file. +
|
(Optional, string) The Java time format of the timestamp field in the file.
|
||||||
+
|
+
|
||||||
--
|
--
|
||||||
NOTE: Only a subset of Java time format letter groups are supported:
|
Only a subset of Java time format letter groups are supported:
|
||||||
|
|
||||||
* `a`
|
* `a`
|
||||||
* `d`
|
* `d`
|
||||||
|
@ -206,6 +210,20 @@ structure finder does not consider by default.
|
||||||
If this parameter is not specified, the structure finder chooses the best
|
If this parameter is not specified, the structure finder chooses the best
|
||||||
format from a built-in set.
|
format from a built-in set.
|
||||||
|
|
||||||
|
The following table provides the appropriate `timeformat` values for some example timestamps:
|
||||||
|
|
||||||
|
|===
|
||||||
|
| Timeformat | Presentation
|
||||||
|
|
||||||
|
| yyyy-MM-dd HH:mm:ssZ | 2019-04-20 13:15:22+0000
|
||||||
|
| EEE, d MMM yyyy HH:mm:ss Z | Sat, 20 Apr 2019 13:15:22 +0000
|
||||||
|
| dd.MM.yy HH:mm:ss.SSS | 20.04.19 13:15:22.285
|
||||||
|
|===
|
||||||
|
|
||||||
|
See
|
||||||
|
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html[the Java date/time format documentation]
|
||||||
|
for more information about date and time format syntax.
|
||||||
|
|
||||||
--
|
--
|
||||||
|
|
||||||
[[ml-find-file-structure-request-body]]
|
[[ml-find-file-structure-request-body]]
|
||||||
|
@ -219,6 +237,9 @@ size, which defaults to 100 Mb.
|
||||||
[[ml-find-file-structure-examples]]
|
[[ml-find-file-structure-examples]]
|
||||||
==== {api-examples-title}
|
==== {api-examples-title}
|
||||||
|
|
||||||
|
[[ml-find-file-structure-example-nld-json]]
|
||||||
|
===== Ingesting newline-delimited JSON
|
||||||
|
|
||||||
Suppose you have a newline-delimited JSON file that contains information about
|
Suppose you have a newline-delimited JSON file that contains information about
|
||||||
some books. You can send the contents to the `find_file_structure` endpoint:
|
some books. You can send the contents to the `find_file_structure` endpoint:
|
||||||
|
|
||||||
|
@ -527,6 +548,10 @@ If the request does not encounter errors, you receive the following result:
|
||||||
may provide clues that the data needs to be cleaned or transformed prior
|
may provide clues that the data needs to be cleaned or transformed prior
|
||||||
to use by other {ml} functionality.
|
to use by other {ml} functionality.
|
||||||
|
|
||||||
|
|
||||||
|
[[ml-find-file-structure-example-nyc]]
|
||||||
|
===== Finding the structure of NYC yellow cab example data
|
||||||
|
|
||||||
The next example shows how it's possible to find the structure of some New York
|
The next example shows how it's possible to find the structure of some New York
|
||||||
City yellow cab trip data. The first `curl` command downloads the data, the
|
City yellow cab trip data. The first `curl` command downloads the data, the
|
||||||
first 20000 lines of which are then piped into the `find_file_structure`
|
first 20000 lines of which are then piped into the `find_file_structure`
|
||||||
|
@ -543,7 +568,7 @@ curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head
|
||||||
--
|
--
|
||||||
NOTE: The `Content-Type: application/json` header must be set even though in
|
NOTE: The `Content-Type: application/json` header must be set even though in
|
||||||
this case the data is not JSON. (Alternatively the `Content-Type` can be set
|
this case the data is not JSON. (Alternatively the `Content-Type` can be set
|
||||||
to any other supported by Elasticsearch, but it must be set.)
|
to any other supported by {es}, but it must be set.)
|
||||||
--
|
--
|
||||||
|
|
||||||
If the request does not encounter errors, you receive the following result:
|
If the request does not encounter errors, you receive the following result:
|
||||||
|
@ -1333,6 +1358,10 @@ If the request does not encounter errors, you receive the following result:
|
||||||
necessary to supply the timezone they relate to. `need_client_timezone`
|
necessary to supply the timezone they relate to. `need_client_timezone`
|
||||||
will be `false` for timestamp formats that include the timezone.
|
will be `false` for timestamp formats that include the timezone.
|
||||||
|
|
||||||
|
|
||||||
|
[[ml-find-file-structure-example-timeout]]
|
||||||
|
===== Setting the timeout parameter
|
||||||
|
|
||||||
If you try to analyze a lot of data then the analysis will take a long time.
|
If you try to analyze a lot of data then the analysis will take a long time.
|
||||||
If you want to limit the amount of processing your {es} cluster performs for
|
If you want to limit the amount of processing your {es} cluster performs for
|
||||||
a request, use the `timeout` query parameter. The analysis will be aborted and
|
a request, use the `timeout` query parameter. The analysis will be aborted and
|
||||||
|
@ -1375,6 +1404,10 @@ and the timeout is measured from the time this endpoint starts to process the
|
||||||
data.
|
data.
|
||||||
--
|
--
|
||||||
|
|
||||||
|
|
||||||
|
[[ml-find-file-structure-example-eslog]]
|
||||||
|
===== Analyzing {es} log files
|
||||||
|
|
||||||
This is an example of analyzing {es}'s own log file:
|
This is an example of analyzing {es}'s own log file:
|
||||||
|
|
||||||
[source,js]
|
[source,js]
|
||||||
|
@ -1523,6 +1556,10 @@ this:
|
||||||
and recognizable fields that appear in every analyzed message. In this case
|
and recognizable fields that appear in every analyzed message. In this case
|
||||||
the only field that was recognized beyond the timestamp was the log level.
|
the only field that was recognized beyond the timestamp was the log level.
|
||||||
|
|
||||||
|
|
||||||
|
[[ml-find-file-structure-example-grok]]
|
||||||
|
===== Specifying `grok_pattern` as query parameter
|
||||||
|
|
||||||
If you recognize more fields than the simple `grok_pattern` produced by the
|
If you recognize more fields than the simple `grok_pattern` produced by the
|
||||||
structure finder unaided then you can resubmit the request specifying a more
|
structure finder unaided then you can resubmit the request specifying a more
|
||||||
advanced `grok_pattern` as a query parameter and the structure finder will
|
advanced `grok_pattern` as a query parameter and the structure finder will
|
||||||
|
|
Loading…
Reference in New Issue