OpenSearch/docs/reference/analysis/testing.asciidoc

[[test-analyzer]]
=== Test an analyzer

The <<indices-analyze,`analyze` API>> is an invaluable tool for viewing the
terms produced by an analyzer. A built-in analyzer can be specified inline in
the request:

[source,console]
-------------------------------------
POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}
-------------------------------------

The API returns the following response:

[source,console-result]
-------------------------------------
{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "fox.",
      "start_offset": 16,
      "end_offset": 20,
      "type": "word",
      "position": 3
    }
  ]
}
-------------------------------------

You can also test combinations of:

* A tokenizer
* Zero or token filters
* Zero or more character filters

[source,console]
-------------------------------------
POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
-------------------------------------

The API returns the following response:

[source,console-result]
-------------------------------------
{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
-------------------------------------

.Positions and character offsets
*********************************************************

As can be seen from the output of the `analyze` API, analyzers not only
convert words into terms, they also record the order or relative _positions_
of each term (used for phrase queries or word proximity queries), and the
start and end _character offsets_ of each term in the original text (used for
highlighting search snippets).

*********************************************************


Alternatively, a <<analysis-custom-analyzer,`custom` analyzer>> can be
referred to when running the `analyze` API on a specific index:

[source,console]
-------------------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": { <1>
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" <2>
      }
    }
  }
}

GET my_index/_analyze <3>
{
  "analyzer": "std_folded", <4>
  "text":     "Is this déjà vu?"
}

GET my_index/_analyze <3>
{
  "field": "my_text", <5>
  "text":  "Is this déjà vu?"
}
-------------------------------------

The API returns the following response:

[source,console-result]
-------------------------------------
{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
-------------------------------------

<1> Define a `custom` analyzer called `std_folded`.
<2> The field `my_text` uses the `std_folded` analyzer.
<3> To refer to this analyzer, the `analyze` API must specify the index name.
<4> Refer to the analyzer by name.
<5> Refer to the analyzer used by field `my_text`.