mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-10 23:15:04 +00:00
Moves the following API sections under the REST APIs navigations: - API Conventions - Document APIs - Search APIs - Index APIs (previously named Indices APIs) - cat APIs - Cluster APIs Other supporting changes: - Removes the previous index APIs page under REST APIs. Adds a redirect for the removed page. - Removes several [partintro] macros so the docs build correctly. - Changes anchors for pages that become sections of a parent page. - Adds several redirects for existing pages that become sections of a parent page. This commit re-applies changes from #44238. Changes from that PR were reverted due to broken links in several repos. This commit adds redirects for those broken links.
247 lines
6.4 KiB
Plaintext
247 lines
6.4 KiB
Plaintext
[[indices-analyze]]
|
|
== Analyze
|
|
|
|
Performs the analysis process on a text and return the tokens breakdown
|
|
of the text.
|
|
|
|
Can be used without specifying an index against one of the many built in
|
|
analyzers:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"analyzer" : "standard",
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
If text parameter is provided as array of strings, it is analyzed as a multi-valued field.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"analyzer" : "standard",
|
|
"text" : ["this is a test", "the second text"]
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
Or by building a custom transient analyzer out of tokenizers,
|
|
token filters and char filters. Token filters can use the shorter 'filter'
|
|
parameter name:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["lowercase"],
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["lowercase"],
|
|
"char_filter" : ["html_strip"],
|
|
"text" : "this is a <b>test</b>"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
deprecated[5.0.0, Use `filter`/`char_filter` instead of `filters`/`char_filters` and `token_filters` has been removed]
|
|
|
|
Custom tokenizers, token filters, and character filters can be specified in the request body as follows:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"tokenizer" : "whitespace",
|
|
"filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
It can also run against a specific index:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET analyze_sample/_analyze
|
|
{
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:analyze_sample]
|
|
|
|
The above will run an analysis on the "this is a test" text, using the
|
|
default index analyzer associated with the `analyze_sample` index. An `analyzer`
|
|
can also be provided to use a different analyzer:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET analyze_sample/_analyze
|
|
{
|
|
"analyzer" : "whitespace",
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:analyze_sample]
|
|
|
|
Also, the analyzer can be derived based on a field mapping, for example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET analyze_sample/_analyze
|
|
{
|
|
"field" : "obj1.field1",
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:analyze_sample]
|
|
|
|
Will cause the analysis to happen based on the analyzer configured in the
|
|
mapping for `obj1.field1` (and if not, the default index analyzer).
|
|
|
|
A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET analyze_sample/_analyze
|
|
{
|
|
"normalizer" : "my_normalizer",
|
|
"text" : "BaR"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:analyze_sample]
|
|
|
|
Or by building a custom transient normalizer out of token filters and char filters.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"filter" : ["lowercase"],
|
|
"text" : "BaR"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
[[explain-analyze-api]]
|
|
=== Explain Analyze
|
|
|
|
If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
|
|
You can filter token attributes you want to output by setting `attributes` option.
|
|
|
|
NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"tokenizer" : "standard",
|
|
"filter" : ["snowball"],
|
|
"text" : "detailed output",
|
|
"explain" : true,
|
|
"attributes" : ["keyword"] <1>
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
<1> Set "keyword" to output "keyword" attribute only
|
|
|
|
The request returns the following result:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"detail" : {
|
|
"custom_analyzer" : true,
|
|
"charfilters" : [ ],
|
|
"tokenizer" : {
|
|
"name" : "standard",
|
|
"tokens" : [ {
|
|
"token" : "detailed",
|
|
"start_offset" : 0,
|
|
"end_offset" : 8,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 0
|
|
}, {
|
|
"token" : "output",
|
|
"start_offset" : 9,
|
|
"end_offset" : 15,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 1
|
|
} ]
|
|
},
|
|
"tokenfilters" : [ {
|
|
"name" : "snowball",
|
|
"tokens" : [ {
|
|
"token" : "detail",
|
|
"start_offset" : 0,
|
|
"end_offset" : 8,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 0,
|
|
"keyword" : false <1>
|
|
}, {
|
|
"token" : "output",
|
|
"start_offset" : 9,
|
|
"end_offset" : 15,
|
|
"type" : "<ALPHANUM>",
|
|
"position" : 1,
|
|
"keyword" : false <1>
|
|
} ]
|
|
} ]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE
|
|
<1> Output only "keyword" attribute, since specify "attributes" in the request.
|
|
|
|
[[tokens-limit-settings]]
|
|
[float]
|
|
== Settings to prevent tokens explosion
|
|
Generating excessive amount of tokens may cause a node to run out of memory.
|
|
The following setting allows to limit the number of tokens that can be produced:
|
|
|
|
`index.analyze.max_token_count`::
|
|
The maximum number of tokens that can be produced using `_analyze` API.
|
|
The default value is `10000`. If more than this limit of tokens gets
|
|
generated, an error will be thrown. The `_analyze` endpoint without a specified
|
|
index will always use `10000` value as a limit. This setting allows you to control
|
|
the limit for a specific index:
|
|
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT analyze_sample
|
|
{
|
|
"settings" : {
|
|
"index.analyze.max_token_count" : 20000
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET analyze_sample/_analyze
|
|
{
|
|
"text" : "this is a test"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:analyze_sample]
|