OpenSearch/docs/reference/indices/analyze.asciidoc

[[indices-analyze]]
== Analyze

Performs the analysis process on a text and return the tokens breakdown
of the text.

Can be used without specifying an index against one of the many built in
analyzers:

[source,js]
--------------------------------------------------
GET _analyze
{
  "analyzer" : "standard",
  "text" : "this is a test"
}
--------------------------------------------------
// CONSOLE

If text parameter is provided as array of strings, it is analyzed as a multi-valued field.

[source,js]
--------------------------------------------------
GET _analyze
{
  "analyzer" : "standard",
  "text" : ["this is a test", "the second text"]
}
--------------------------------------------------
// CONSOLE

Or by building a custom transient analyzer out of tokenizers,
token filters and char filters. Token filters can use the shorter 'filter'
parameter name:

[source,js]
--------------------------------------------------
GET _analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "text" : "this is a test"
}
--------------------------------------------------
// CONSOLE

[source,js]
--------------------------------------------------
GET _analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "this is a <b>test</b>"
}
--------------------------------------------------
// CONSOLE

deprecated[5.0.0, Use `filter`/`char_filter` instead of `filters`/`char_filters` and `token_filters` has been removed]

Custom tokenizers, token filters, and character filters can be specified in the request body as follows:

[source,js]
--------------------------------------------------
GET _analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
  "text" : "this is a test"
}
--------------------------------------------------
// CONSOLE

It can also run against a specific index:

[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
  "text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]

The above will run an analysis on the "this is a test" text, using the
default index analyzer associated with the `analyze_sample` index. An `analyzer`
can also be provided to use a different analyzer:

[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
  "analyzer" : "whitespace",
  "text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]

Also, the analyzer can be derived based on a field mapping, for example:

[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
  "field" : "obj1.field1",
  "text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]

Will cause the analysis to happen based on the analyzer configured in the
mapping for `obj1.field1` (and if not, the default index analyzer).

A `normalizer` can be provided for keyword field with normalizer associated with the `analyze_sample` index.

[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
  "normalizer" : "my_normalizer",
  "text" : "BaR"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]

Or by building a custom transient normalizer out of token filters and char filters.

[source,js]
--------------------------------------------------
GET _analyze
{
  "filter" : ["lowercase"],
  "text" : "BaR"
}
--------------------------------------------------
// CONSOLE

=== Explain Analyze

If you want to get more advanced details, set `explain` to `true` (defaults to `false`). It will output all token attributes for each token.
You can filter token attributes you want to output by setting `attributes` option.

NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.

[source,js]
--------------------------------------------------
GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["snowball"],
  "text" : "detailed output",
  "explain" : true,
  "attributes" : ["keyword"] <1>
}
--------------------------------------------------
// CONSOLE
<1> Set "keyword" to output "keyword" attribute only

The request returns the following result:

[source,js]
--------------------------------------------------
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ {
        "token" : "detailed",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1
      } ]
    },
    "tokenfilters" : [ {
      "name" : "snowball",
      "tokens" : [ {
        "token" : "detail",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "keyword" : false <1>
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1,
        "keyword" : false <1>
      } ]
    } ]
  }
}
--------------------------------------------------
// TESTRESPONSE
<1> Output only "keyword" attribute, since specify "attributes" in the request.

[[tokens-limit-settings]]
[float]
== Settings to prevent tokens explosion
Generating excessive amount of tokens may cause a node to run out of memory.
The following setting allows to limit the number of tokens that can be produced:

`index.analyze.max_token_count`::
    The maximum number of tokens that can be produced using `_analyze` API.
    The default value is `10000`. If more than this limit of tokens gets
    generated, an error will be thrown. The `_analyze` endpoint without a specified
    index will always use `10000` value as a limit. This setting allows you to control
    the limit for a specific index:


[source,js]
--------------------------------------------------
PUT analyze_sample?include_type_name=true
{
  "settings" : {
    "index.analyze.max_token_count" : 20000
  }
}
--------------------------------------------------
// CONSOLE


[source,js]
--------------------------------------------------
GET analyze_sample/_analyze
{
  "text" : "this is a test"
}
--------------------------------------------------
// CONSOLE
// TEST[setup:analyze_sample]