374 lines
9.6 KiB
Plaintext
374 lines
9.6 KiB
Plaintext
[[indices-analyze]]
|
|
=== Analyze API
|
|
++++
|
|
<titleabbrev>Analyze</titleabbrev>
|
|
++++
|
|
|
|
Performs <<analysis,analysis>> on a text string
|
|
and returns the resulting tokens.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"analyzer" : "standard",
|
|
"text" : "Quick Brown Foxes!"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
[[analyze-api-request]]
|
|
==== {api-request-title}
|
|
|
|
`GET /_analyze`
|
|
|
|
`POST /_analyze`
|
|
|
|
`GET /<index>/_analyze`
|
|
|
|
`POST /<index>/_analyze`
|
|
|
|
|
|
[[analyze-api-path-params]]
|
|
==== {api-path-parms-title}
|
|
|
|
`<index>`::
|
|
+
|
|
--
|
|
(Optional, string)
|
|
Index used to derive the analyzer.
|
|
|
|
If specified,
|
|
the `analyzer` or `<field>` parameter overrides this value.
|
|
|
|
If no analyzer or field are specified,
|
|
the analyze API uses the default analyzer for the index.
|
|
|
|
If no index is specified
|
|
or the index does not have a default analyzer,
|
|
the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
|
|
--
|
|
|
|
|
|
[[analyze-api-query-params]]
|
|
==== {api-query-parms-title}
|
|
|
|
`analyzer`::
|
|
+
|
|
--
|
|
(Optional, string)
|
|
The name of the analyzer that should be applied to the provided `text`. This could be a
|
|
<<analysis-analyzers, built-in analyzer>>, or an analyzer that's been configured in the index.
|
|
|
|
If this parameter is not specified,
|
|
the analyze API uses the analyzer defined in the field's mapping.
|
|
|
|
If no field is specified,
|
|
the analyze API uses the default analyzer for the index.
|
|
|
|
If no index is specified,
|
|
or the index does not have a default analyzer,
|
|
the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
|
|
--
|
|
|
|
`attributes`::
|
|
(Optional, array of strings)
|
|
Array of token attributes used to filter the output of the `explain` parameter.
|
|
|
|
`char_filter`::
|
|
(Optional, array of strings)
|
|
Array of character filters used to preprocess characters before the tokenizer.
|
|
See <<analysis-charfilters>> for a list of character filters.
|
|
|
|
`explain`::
|
|
(Optional, boolean)
|
|
If `true`, the response includes token attributes and additional details.
|
|
Defaults to `false`.
|
|
experimental:[The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.]
|
|
|
|
`field`::
|
|
+
|
|
--
|
|
(Optional, string)
|
|
Field used to derive the analyzer.
|
|
To use this parameter,
|
|
you must specify an index.
|
|
|
|
If specified,
|
|
the `analyzer` parameter overrides this value.
|
|
|
|
If no field is specified,
|
|
the analyze API uses the default analyzer for the index.
|
|
|
|
If no index is specified
|
|
or the index does not have a default analyzer,
|
|
the analyze API uses the <<analysis-standard-analyzer,standard analyzer>>.
|
|
--
|
|
|
|
`filter`::
|
|
(Optional, Array of strings)
|
|
Array of token filters used to apply after the tokenizer.
|
|
See <<analysis-tokenfilters>> for a list of token filters.
|
|
|
|
`normalizer`::
|
|
(Optional, string)
|
|
Normalizer to use to convert text into a single token.
|
|
See <<analysis-normalizers>> for a list of normalizers.
|
|
|
|
`text`::
|
|
(Required, string or array of strings)
|
|
Text to analyze.
|
|
If an array of strings is provided, it is analyzed as a multi-value field.
|
|
|
|
`tokenizer`::
|
|
(Optional, string)
|
|
Tokenizer to use to convert text into tokens.
|
|
See <<analysis-tokenizers>> for a list of tokenizers.
|
|
|
|
[[analyze-api-example]]
|
|
==== {api-examples-title}
|
|
|
|
[[analyze-api-no-index-ex]]
|
|
===== No index specified
|
|
|
|
You can apply any of the built-in analyzers to the text string without
|
|
specifying an index.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"analyzer" : "standard",
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[[analyze-api-text-array-ex]]
|
|
===== Array of text strings
|
|
|
|
If the `text` parameter is provided as array of strings, it is analyzed as a multi-value field.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"analyzer" : "standard",
|
|
"text" : ["this is a test", "the second text"]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[[analyze-api-custom-analyzer-ex]]
|
|
===== Custom analyzer
|
|
|
|
You can use the analyze API to test a custom transient analyzer built from
|
|
tokenizers, token filters, and char filters. Token filters use the `filter`
|
|
parameter:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["lowercase"],
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["lowercase"],
|
|
"char_filter" : ["html_strip"],
|
|
"text" : "this is a <b>test</b>"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Custom tokenizers, token filters, and character filters can be specified in the request body as follows:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"tokenizer" : "whitespace",
|
|
"filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[[analyze-api-specific-index-ex]]
|
|
===== Specific index
|
|
|
|
You can also run the analyze API against a specific index:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /analyze_sample/_analyze
|
|
{
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:analyze_sample]
|
|
|
|
The above will run an analysis on the "this is a test" text, using the
|
|
default index analyzer associated with the `analyze_sample` index. An `analyzer`
|
|
can also be provided to use a different analyzer:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /analyze_sample/_analyze
|
|
{
|
|
"analyzer" : "whitespace",
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:analyze_sample]
|
|
|
|
[[analyze-api-field-ex]]
|
|
===== Derive analyzer from a field mapping
|
|
|
|
The analyzer can be derived based on a field mapping, for example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /analyze_sample/_analyze
|
|
{
|
|
"field" : "obj1.field1",
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:analyze_sample]
|
|
|
|
Will cause the analysis to happen based on the analyzer configured in the
|
|
mapping for `obj1.field1` (and if not, the default index analyzer).
|
|
|
|
[[analyze-api-normalizer-ex]]
|
|
===== Normalizer
|
|
|
|
A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /analyze_sample/_analyze
|
|
{
|
|
"normalizer" : "my_normalizer",
|
|
"text" : "BaR"
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:analyze_sample]
|
|
|
|
Or by building a custom transient normalizer out of token filters and char filters.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"filter" : ["lowercase"],
|
|
"text" : "BaR"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[[explain-analyze-api]]
|
|
===== Explain analyze
|
|
|
|
If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
|
|
You can filter token attributes you want to output by setting `attributes` option.
|
|
|
|
NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /_analyze
|
|
{
|
|
"tokenizer" : "standard",
|
|
"filter" : ["snowball"],
|
|
"text" : "detailed output",
|
|
"explain" : true,
|
|
"attributes" : ["keyword"] <1>
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Set "keyword" to output "keyword" attribute only
|
|
|
|
The request returns the following result:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"detail" : {
|
|
"custom_analyzer" : true,
|
|
"charfilters" : [ ],
|
|
"tokenizer" : {
|
|
"name" : "standard",
|
|
"tokens" : [ {
|
|
"token" : "detailed",
|
|
"start_offset" : 0,
|
|
"end_offset" : 8,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 0
|
|
}, {
|
|
"token" : "output",
|
|
"start_offset" : 9,
|
|
"end_offset" : 15,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 1
|
|
} ]
|
|
},
|
|
"tokenfilters" : [ {
|
|
"name" : "snowball",
|
|
"tokens" : [ {
|
|
"token" : "detail",
|
|
"start_offset" : 0,
|
|
"end_offset" : 8,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 0,
|
|
"keyword" : false <1>
|
|
}, {
|
|
"token" : "output",
|
|
"start_offset" : 9,
|
|
"end_offset" : 15,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 1,
|
|
"keyword" : false <1>
|
|
} ]
|
|
} ]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Output only "keyword" attribute, since specify "attributes" in the request.
|
|
|
|
[[tokens-limit-settings]]
|
|
===== Setting a token limit
|
|
Generating excessive amount of tokens may cause a node to run out of memory.
|
|
The following setting allows to limit the number of tokens that can be produced:
|
|
|
|
`index.analyze.max_token_count`::
|
|
The maximum number of tokens that can be produced using `_analyze` API.
|
|
The default value is `10000`. If more than this limit of tokens gets
|
|
generated, an error will be thrown. The `_analyze` endpoint without a specified
|
|
index will always use `10000` value as a limit. This setting allows you to control
|
|
the limit for a specific index:
|
|
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /analyze_sample
|
|
{
|
|
"settings" : {
|
|
"index.analyze.max_token_count" : 20000
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET /analyze_sample/_analyze
|
|
{
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[setup:analyze_sample]
|