OpenSearch/docs/reference/analysis/testing.asciidoc

[[test-analyzer]]
=== Test an analyzer

The <<indices-analyze,`analyze` API>> is an invaluable tool for viewing the
terms produced by an analyzer. A built-in analyzer can be specified inline in
the request:

[source,console]
-------------------------------------
POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}
-------------------------------------

The API returns the following response:

[source,console-result]
-------------------------------------
{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "fox.",
      "start_offset": 16,
      "end_offset": 20,
      "type": "word",
      "position": 3
    }
  ]
}
-------------------------------------

You can also test combinations of:

* A tokenizer
* Zero or more token filters
* Zero or more character filters

[source,console]
-------------------------------------
POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
-------------------------------------

The API returns the following response:

[source,console-result]
-------------------------------------
{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
-------------------------------------

.Positions and character offsets
*********************************************************

As can be seen from the output of the `analyze` API, analyzers not only
convert words into terms, they also record the order or relative _positions_
of each term (used for phrase queries or word proximity queries), and the
start and end _character offsets_ of each term in the original text (used for
highlighting search snippets).

*********************************************************


Alternatively, a <<analysis-custom-analyzer,`custom` analyzer>> can be
referred to when running the `analyze` API on a specific index:

[source,console]
-------------------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": { <1>
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" <2>
      }
    }
  }
}

GET my_index/_analyze <3>
{
  "analyzer": "std_folded", <4>
  "text":     "Is this déjà vu?"
}

GET my_index/_analyze <3>
{
  "field": "my_text", <5>
  "text":  "Is this déjà vu?"
}
-------------------------------------

The API returns the following response:

[source,console-result]
-------------------------------------
{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
-------------------------------------

<1> Define a `custom` analyzer called `std_folded`.
<2> The field `my_text` uses the `std_folded` analyzer.
<3> To refer to this analyzer, the `analyze` API must specify the index name.
<4> Refer to the analyzer by name.
<5> Refer to the analyzer used by field `my_text`.
[DOCS] Add tutorials section to analysis topic (#50809) Adds a 'Configure text analysis' page to house tutorial content for the analysis topic. Also relocates the following pages as children as this new page: * 'Test an analyzer' * 'Configuring built-in analyzers' * 'Create a custom analyzer' I plan to add a tutorial for specifying index-time and search-time analyzers to this section as part of a future PR. 2020-01-16 13:11:42 -05:00			`[[test-analyzer]]`
			`=== Test an analyzer`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			The <<indices-analyze,`analyze` API>> is an invaluable tool for viewing the
[DOCS] Add response snippets to 'Testing analyzers' page (#51427) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com> 2020-01-27 08:41:05 -05:00			`terms produced by an analyzer. A built-in analyzer can be specified inline in`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`the request:`

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`-------------------------------------`
			`POST _analyze`
			`{`
			`"analyzer": "whitespace",`
			`"text": "The quick brown fox."`
			`}`
[DOCS] Add response snippets to 'Testing analyzers' page (#51427) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com> 2020-01-27 08:41:05 -05:00			`-------------------------------------`

			`The API returns the following response:`

			`[source,console-result]`
			`-------------------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "The",`
			`"start_offset": 0,`
			`"end_offset": 3,`
			`"type": "word",`
			`"position": 0`
			`},`
			`{`
			`"token": "quick",`
			`"start_offset": 4,`
			`"end_offset": 9,`
			`"type": "word",`
			`"position": 1`
			`},`
			`{`
			`"token": "brown",`
			`"start_offset": 10,`
			`"end_offset": 15,`
			`"type": "word",`
			`"position": 2`
			`},`
			`{`
			`"token": "fox.",`
			`"start_offset": 16,`
			`"end_offset": 20,`
			`"type": "word",`
			`"position": 3`
			`}`
			`]`
			`}`
			`-------------------------------------`

			`You can also test combinations of:`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[DOCS] Add response snippets to 'Testing analyzers' page (#51427) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com> 2020-01-27 08:41:05 -05:00			`* A tokenizer`
[Docs] Fix typo in _analyze api docs (#53837) 2020-03-20 06:44:53 -04:00			`* Zero or more token filters`
[DOCS] Add response snippets to 'Testing analyzers' page (#51427) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com> 2020-01-27 08:41:05 -05:00			`* Zero or more character filters`

			`[source,console]`
			`-------------------------------------`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`POST _analyze`
			`{`
			`"tokenizer": "standard",`
			`"filter": [ "lowercase", "asciifolding" ],`
			`"text": "Is this déja vu?"`
			`}`
			`-------------------------------------`

[DOCS] Add response snippets to 'Testing analyzers' page (#51427) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com> 2020-01-27 08:41:05 -05:00			`The API returns the following response:`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[DOCS] Add response snippets to 'Testing analyzers' page (#51427) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com> 2020-01-27 08:41:05 -05:00			`[source,console-result]`
			`-------------------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "is",`
			`"start_offset": 0,`
			`"end_offset": 2,`
			`"type": "<ALPHANUM>",`
			`"position": 0`
			`},`
			`{`
			`"token": "this",`
			`"start_offset": 3,`
			`"end_offset": 7,`
			`"type": "<ALPHANUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "deja",`
			`"start_offset": 8,`
			`"end_offset": 12,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "vu",`
			`"start_offset": 13,`
			`"end_offset": 15,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`}`
			`]`
			`}`
			`-------------------------------------`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`.Positions and character offsets`
			`*********************************************************`

			As can be seen from the output of the `analyze` API, analyzers not only
			`convert words into terms, they also record the order or relative _positions_`
			`of each term (used for phrase queries or word proximity queries), and the`
			`start and end _character offsets_ of each term in the original text (used for`
			`highlighting search snippets).`

			`*********************************************************`


			Alternatively, a <<analysis-custom-analyzer,`custom` analyzer>> can be
			referred to when running the `analyze` API on a specific index:

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`-------------------------------------`
Remove more include_type_name and types from docs (#37601) 2019-01-18 08:11:18 -05:00			`PUT my_index`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"std_folded": { <1>`
			`"type": "custom",`
			`"tokenizer": "standard",`
			`"filter": [`
			`"lowercase",`
			`"asciifolding"`
			`]`
			`}`
			`}`
			`}`
			`},`
			`"mappings": {`
Remove more include_type_name and types from docs (#37601) 2019-01-18 08:11:18 -05:00			`"properties": {`
			`"my_text": {`
			`"type": "text",`
			`"analyzer": "std_folded" <2>`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`}`
			`}`
			`}`
			`}`

			`GET my_index/_analyze <3>`
			`{`
			`"analyzer": "std_folded", <4>`
			`"text": "Is this déjà vu?"`
			`}`

			`GET my_index/_analyze <3>`
			`{`
			`"field": "my_text", <5>`
			`"text": "Is this déjà vu?"`
			`}`
			`-------------------------------------`

[DOCS] Add response snippets to 'Testing analyzers' page (#51427) Adds response snippets to the `POST _analyze` snippets in the 'Testing analyzers' page. Co-authored-by: Emmanuel DEMEY <demey.emmanuel@gmail.com> 2020-01-27 08:41:05 -05:00			`The API returns the following response:`

			`[source,console-result]`
			`-------------------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "is",`
			`"start_offset": 0,`
			`"end_offset": 2,`
			`"type": "<ALPHANUM>",`
			`"position": 0`
			`},`
			`{`
			`"token": "this",`
			`"start_offset": 3,`
			`"end_offset": 7,`
			`"type": "<ALPHANUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "deja",`
			`"start_offset": 8,`
			`"end_offset": 12,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "vu",`
			`"start_offset": 13,`
			`"end_offset": 15,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`}`
			`]`
			`}`
			`-------------------------------------`

First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			<1> Define a `custom` analyzer called `std_folded`.
			<2> The field `my_text` uses the `std_folded` analyzer.
			<3> To refer to this analyzer, the `analyze` API must specify the index name.
			`<4> Refer to the analyzer by name.`
			<5> Refer to the analyzer used by field `my_text`.