2020-01-16 13:11:42 -05:00
|
|
|
[[test-analyzer]]
|
|
|
|
=== Test an analyzer
|
2016-05-11 08:17:56 -04:00
|
|
|
|
|
|
|
The <<indices-analyze,`analyze` API>> is an invaluable tool for viewing the
|
2020-01-27 08:41:05 -05:00
|
|
|
terms produced by an analyzer. A built-in analyzer can be specified inline in
|
2016-05-11 08:17:56 -04:00
|
|
|
the request:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2016-05-11 08:17:56 -04:00
|
|
|
-------------------------------------
|
|
|
|
POST _analyze
|
|
|
|
{
|
|
|
|
"analyzer": "whitespace",
|
|
|
|
"text": "The quick brown fox."
|
|
|
|
}
|
2020-01-27 08:41:05 -05:00
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
The API returns the following response:
|
|
|
|
|
|
|
|
[source,console-result]
|
|
|
|
-------------------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "The",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 3,
|
|
|
|
"type": "word",
|
|
|
|
"position": 0
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "quick",
|
|
|
|
"start_offset": 4,
|
|
|
|
"end_offset": 9,
|
|
|
|
"type": "word",
|
|
|
|
"position": 1
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "brown",
|
|
|
|
"start_offset": 10,
|
|
|
|
"end_offset": 15,
|
|
|
|
"type": "word",
|
|
|
|
"position": 2
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "fox.",
|
|
|
|
"start_offset": 16,
|
|
|
|
"end_offset": 20,
|
|
|
|
"type": "word",
|
|
|
|
"position": 3
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
You can also test combinations of:
|
2016-05-11 08:17:56 -04:00
|
|
|
|
2020-01-27 08:41:05 -05:00
|
|
|
* A tokenizer
|
2020-03-20 06:44:53 -04:00
|
|
|
* Zero or more token filters
|
2020-01-27 08:41:05 -05:00
|
|
|
* Zero or more character filters
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
-------------------------------------
|
2016-05-11 08:17:56 -04:00
|
|
|
POST _analyze
|
|
|
|
{
|
|
|
|
"tokenizer": "standard",
|
|
|
|
"filter": [ "lowercase", "asciifolding" ],
|
|
|
|
"text": "Is this déja vu?"
|
|
|
|
}
|
|
|
|
-------------------------------------
|
|
|
|
|
2020-01-27 08:41:05 -05:00
|
|
|
The API returns the following response:
|
2016-05-11 08:17:56 -04:00
|
|
|
|
2020-01-27 08:41:05 -05:00
|
|
|
[source,console-result]
|
|
|
|
-------------------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "is",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 2,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 0
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "this",
|
|
|
|
"start_offset": 3,
|
|
|
|
"end_offset": 7,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 1
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "deja",
|
|
|
|
"start_offset": 8,
|
|
|
|
"end_offset": 12,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 2
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "vu",
|
|
|
|
"start_offset": 13,
|
|
|
|
"end_offset": 15,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 3
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
-------------------------------------
|
2016-05-11 08:17:56 -04:00
|
|
|
|
|
|
|
.Positions and character offsets
|
|
|
|
*********************************************************
|
|
|
|
|
|
|
|
As can be seen from the output of the `analyze` API, analyzers not only
|
|
|
|
convert words into terms, they also record the order or relative _positions_
|
|
|
|
of each term (used for phrase queries or word proximity queries), and the
|
|
|
|
start and end _character offsets_ of each term in the original text (used for
|
|
|
|
highlighting search snippets).
|
|
|
|
|
|
|
|
*********************************************************
|
|
|
|
|
|
|
|
|
|
|
|
Alternatively, a <<analysis-custom-analyzer,`custom` analyzer>> can be
|
|
|
|
referred to when running the `analyze` API on a specific index:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2016-05-11 08:17:56 -04:00
|
|
|
-------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT my_index
|
2016-05-11 08:17:56 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"std_folded": { <1>
|
|
|
|
"type": "custom",
|
|
|
|
"tokenizer": "standard",
|
|
|
|
"filter": [
|
|
|
|
"lowercase",
|
|
|
|
"asciifolding"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"mappings": {
|
2019-01-18 08:11:18 -05:00
|
|
|
"properties": {
|
|
|
|
"my_text": {
|
|
|
|
"type": "text",
|
|
|
|
"analyzer": "std_folded" <2>
|
2016-05-11 08:17:56 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
GET my_index/_analyze <3>
|
|
|
|
{
|
|
|
|
"analyzer": "std_folded", <4>
|
|
|
|
"text": "Is this déjà vu?"
|
|
|
|
}
|
|
|
|
|
|
|
|
GET my_index/_analyze <3>
|
|
|
|
{
|
|
|
|
"field": "my_text", <5>
|
|
|
|
"text": "Is this déjà vu?"
|
|
|
|
}
|
|
|
|
-------------------------------------
|
|
|
|
|
2020-01-27 08:41:05 -05:00
|
|
|
The API returns the following response:
|
|
|
|
|
|
|
|
[source,console-result]
|
|
|
|
-------------------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "is",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 2,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 0
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "this",
|
|
|
|
"start_offset": 3,
|
|
|
|
"end_offset": 7,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 1
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "deja",
|
|
|
|
"start_offset": 8,
|
|
|
|
"end_offset": 12,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 2
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "vu",
|
|
|
|
"start_offset": 13,
|
|
|
|
"end_offset": 15,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 3
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
-------------------------------------
|
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
<1> Define a `custom` analyzer called `std_folded`.
|
|
|
|
<2> The field `my_text` uses the `std_folded` analyzer.
|
|
|
|
<3> To refer to this analyzer, the `analyze` API must specify the index name.
|
|
|
|
<4> Refer to the analyzer by name.
|
|
|
|
<5> Refer to the analyzer used by field `my_text`.
|